From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
Henry and others have shown a couple codes that meet this requirement.
From: franktaw@netscape.net
I'm not sure exactly you define efficiency,
You divide the entropy by the, er, weighted bits per symbol of the code. If P(s) is the probability of symbol s, then the entropy (in bits) of symbol s is -log2(P(s)), and the entropy for the whole alphabet is the sum of -log2( P(s) ) * P( s ) over the whole alphabet. If the code uses bits( s ) to encode symbol s, then the weighted bits per symbol for the alphabet is the sum of bits( s ) * P( s ) over the whole alphabet. The efficiency of the code is the entropy divided by the average bits per symbol. (Henry, that is the way you're calculating it, right?)
but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
I wanted to point out where this intuition went wrong. The code must be inefficient, but the inefficiency can approach zero for larger and larger alphabets, as a ratio of the size of the code word to what it would ideally be. "Half of the code space" is the misleading idea. Half of the code space means all the words that start with (say) a one bit. So, if half the code space is wasted, that means that there is one extra bit on every word. If only we could waste just half the code space! In fact as you use larger and larger alphabets, more and more code space is wasted on small numbers with less and less probability. But wasted bits are just the log of the wasted space. For the efficiency to approach 100%, the wasted bits just have to grow more slowly than the ideal size of the word. That's true if the they grow like log2( size of the word ), which is like log2( log2( s ) ). This is true of a couple codes that have been discussed here. A code where this property is easy to see encodes each number by: b zero bits, where b is the number of bits in the number of bits in s b bits encoding n = number of bits in s n bits encoding s So, 0 => 0 1 0 (not used for Zipf distributions) 1 => 0 1 1 2 => 00 10 10 3 => 00 10 11 4 => 00 11 100 256 => 0000 1001 100000000 65535 => 00000 10000 1111111111111111 2^32 - 1 => 000000 100000 11111111111111111111111111111111 This code is awful, but (e.g.) the efficiency for a 32 bit number is at worst 32 / ( 6 + 6 + 32 ) = 73%. You can see that the first two columns grow like the log of the size of the third column, so this meets Henry's requirement. For a kinda large number, take the bits on my hard disk as representing a single integer. 80 GB = 80 x 2^33 bits < 2^40 bits, so this would be represented by 40 zero bits, 40 size bits, and ~80 x 2^33 data bits. 80 wasted bits in 80 GB of data, 99.9999999884% efficiency. My whole disk packed efficiently into 1/2^80, one millionth of one millionth of one millionth of one millionth of the code space. --Steve