At 06:55 PM 6/24/2006, Steve Witham wrote:
From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?
Henry and others have shown a couple codes that meet this requirement.
I now know that a code that is independent of alphabet size is called a "universal" code. Approx. 100% efficiency is possible only if the bulk of the encoding is in straight binary (or equivalent).
From: franktaw@netscape.net
I'm not sure exactly you define efficiency,
You divide the entropy by the, er, weighted bits per symbol of the code. If P(s) is the probability of symbol s, then the entropy (in bits) of symbol s is -log2(P(s)), and the entropy for the whole alphabet is the sum of -log2( P(s) ) * P( s ) over the whole alphabet. If the code uses bits( s ) to encode symbol s, then the weighted bits per symbol for the alphabet is the sum of bits( s ) * P( s ) over the whole alphabet.
The efficiency of the code is the entropy divided by the average bits per symbol. (Henry, that is the way you're calculating it, right?)
Yes, that is also my definition of "efficiency".
but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
I wanted to point out where this intuition went wrong. The code must be inefficient, but the inefficiency can approach zero for larger and larger alphabets, as a ratio of the size of the code word to what it would ideally be.
"Half of the code space" is the misleading idea. Half of the code space means all the words that start with (say) a one bit. So, if half the code space is wasted, that means that there is one extra bit on every word. If only we could waste just half the code space! In fact as you use larger and larger alphabets, more and more code space is wasted on small numbers with less and less probability. But wasted bits are just the log of the wasted space.
For the efficiency to approach 100%, the wasted bits just have to grow more slowly than the ideal size of the word. That's true if the they grow like log2( size of the word ), which is like log2( log2( s ) ). This is true of a couple codes that have been discussed here.
A code where this property is easy to see encodes each number by: b zero bits, where b is the number of bits in the number of bits in s b bits encoding n = number of bits in s n bits encoding s
So, 0 => 0 1 0 (not used for Zipf distributions) 1 => 0 1 1 2 => 00 10 10 3 => 00 10 11 4 => 00 11 100 256 => 0000 1001 100000000 65535 => 00000 10000 1111111111111111 2^32 - 1 => 000000 100000 11111111111111111111111111111111
This code is awful, but (e.g.) the efficiency for a 32 bit number is at worst 32 / ( 6 + 6 + 32 ) = 73%.
You can see that the first two columns grow like the log of the size of the third column, so this meets Henry's requirement.
For a kinda large number, take the bits on my hard disk as representing a single integer. 80 GB = 80 x 2^33 bits < 2^40 bits, so this would be represented by 40 zero bits, 40 size bits, and ~80 x 2^33 data bits. 80 wasted bits in 80 GB of data, 99.9999999884% efficiency. My whole disk packed efficiently into 1/2^80, one millionth of one millionth of one millionth of one millionth of the code space.
--Steve
I did some experiments with yet another encoding, which I called "split" encoding, which is _not_ universal. The idea is to look at the entire non-normalized Zipf distribution 1/i (i=1,n) and the find the location in the middle of this distribution where the weight on one side approximates the weight on the other. Since this distribution is approximated by the curve 1/x (1<=x<=n), and since the area under this curve is ln(x), we quickly see that the location in the "middle" is sqrt(x) -- i.e., the area to the left of sqrt(x) (1<=z<=sqrt(x)) is equal to the area to the right of sqrt(x) (sqrt(x)<=z<=x). We now assign a high order bit of "0" for i (i=1,sqrt(n)) and a high order bit of "1" for i (i=sqrt(n),n). We now recurse on the two sub-segments (1 to sqrt(n) and sqrt(n) to n). (Of course, we can do a little bit better by breaking the sequence up based on the actual harmonic sum rather than simply breaking it at sqrt(n).) This "split" encoding isn't as good as Huffman, because the average subtree doesn't split evenly as often. However, "split" does have the property that _the encoding preserves ordering_ of the symbols, whereas Huffman does not. Huffman loses its ordering the moment it "bubbles up" a subtree. Thus, Huffman achieves its efficiency by _re-ordering_ the symbols to get the most even splits. Since this re-ordering is highly dependent on the exact number of symbols, Huffman can never be "universal". This "split" encoding also demonstrates the problem with trying to come up with a "universal" encoding for the Zipf distribution: approx half of the cdf is to the right of sqrt(n), and since sqrt(n) increases with increasing n, the encoding is necessarily non-universal. I also instrumented Huffman on the Zipf distribution to try to better understand how evenly it splits the subtrees. Interestingly, no matter how large I made n, Huffman on Zipf of n symbols always had at least one subtree where the weights of the two branches differed by more than 10%. Of course, the efficiency of Huffman doesn't depend upon the largest unbalance, but upon the average unbalance. What is the smallest n for which the Huffman encoding of the Zipf distribution of n symbols has no imbalance exceeding 10% ? I conjecture that this could be an exceedingly large number, because the penalty for such an imbalance becomes vanishingly small with increasing n.