Re: [math-fun] Universal ZipfHuffs

26 Jun 2006

      At 06:55 PM 6/24/2006, Steve Witham wrote:
...
...
From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of
the symbol of relative probability 1/i is _independent_ of the alphabet
size and the encoding tends towards 100% efficiency asymptotically?
Henry and others have shown a couple codes that meet this requirement.
I now know that a code that is independent of alphabet size is called
a "universal" code.  Approx. 100% efficiency is possible only if the
bulk of the encoding is in straight binary (or equivalent).
...
...
From: franktaw@netscape.net
I'm not sure exactly you define efficiency,
You divide the entropy by the, er, weighted bits per symbol of the
code.  If P(s) is the probability of symbol s, then the entropy
(in bits) of symbol s is -log2(P(s)), and the entropy for the
whole alphabet is
    the sum of   -log2( P(s) ) * P( s )  over the whole alphabet.
If the code uses bits( s ) to encode symbol s, then the weighted
bits per symbol for the alphabet is
    the sum of   bits( s ) * P( s )  over the whole alphabet.
The efficiency of the code is the entropy divided by the average
bits per symbol.  (Henry, that is the way you're calculating it, right?)
Yes, that is also my definition of "efficiency".
...
...
but I'm pretty sure that
this is impossible.  If the first k values use half the available code
space, as n goes to infinity this half of the code space will be used
for values with asymptotically zero (O(1/log(n)) of the probability
distribution.  I don't see how this can be anything but inefficient.
I wanted to point out where this intuition went wrong.  The code must be
inefficient, but the inefficiency can approach zero for larger and
larger alphabets, as a ratio of the size of the code word to what it would
ideally be.
"Half of the code space" is the misleading idea.  Half of the
code space means all the words that start with (say) a one bit.  So, if
half the code space is wasted, that means that there is one extra bit
on every word.  If only we could waste just half the code space!
In fact as you use larger and larger alphabets, more and more code space
is wasted on small numbers with less and less probability.  But wasted
bits are just the log of the wasted space.
For the efficiency to approach 100%, the wasted bits just have to grow
more slowly than the ideal size of the word.  That's true if the
they grow like log2( size of the word ), which is like
log2( log2( s ) ).  This is true of a couple codes that have been
discussed here.
A code where this property is easy to see encodes each number by:
   b zero bits, where b is the number of bits in the number of bits in s
   b bits encoding n = number of bits in s
   n bits encoding s
So,
       0 =>          0      1    0  (not used for Zipf distributions)
       1 =>          0      1    1
       2 =>         00     10    10
       3 =>         00     10    11
       4 =>         00     11    100
     256 =>       0000   1001    100000000
   65535 =>      00000  10000    1111111111111111
2^32 - 1 =>     000000 100000    11111111111111111111111111111111
This code is awful, but (e.g.) the efficiency for a 32 bit number is
at worst 32 / ( 6 + 6 + 32 ) = 73%.
You can see that the first two columns grow like the log of the
size of the third column, so this meets Henry's requirement.
For a kinda large number, take the bits on my hard disk as representing
a single integer.  80 GB = 80 x 2^33 bits < 2^40 bits, so this would
be represented by 40 zero bits, 40 size bits, and ~80 x 2^33 data
bits.  80 wasted bits in 80 GB of data, 99.9999999884% efficiency.
My whole disk packed efficiently into 1/2^80, one millionth of
one millionth of one millionth of one millionth of the code space.
--Steve
I did some experiments with yet another encoding, which I called "split"
encoding, which is _not_ universal.  The idea is to look at the entire
non-normalized Zipf distribution 1/i (i=1,n) and the find the location in
the middle of this distribution where the weight on one side approximates
the weight on the other.  Since this distribution is approximated by the
curve 1/x (1<=x<=n), and since the area under this curve is ln(x), we quickly
see that the location in the "middle" is sqrt(x) -- i.e., the area to
the left of sqrt(x) (1<=z<=sqrt(x)) is equal to the area to the right
of sqrt(x) (sqrt(x)<=z<=x).

We now assign a high order bit of "0" for i (i=1,sqrt(n)) and a high
order bit of "1" for i (i=sqrt(n),n).  We now recurse on the two
sub-segments (1 to sqrt(n) and sqrt(n) to n).  (Of course, we can do
a little bit better by breaking the sequence up based on the actual
harmonic sum rather than simply breaking it at sqrt(n).)

This "split" encoding isn't as good as Huffman, because the average
subtree doesn't split evenly as often.  However, "split" does have
the property that _the encoding preserves ordering_ of the symbols,
whereas Huffman does not.  Huffman loses its ordering the moment it
"bubbles up" a subtree.  Thus, Huffman achieves its efficiency by
_re-ordering_ the symbols to get the most even splits.  Since this
re-ordering is highly dependent on the exact number of symbols, Huffman
can never be "universal".

This "split" encoding also demonstrates the problem with trying to
come up with a "universal" encoding for the Zipf distribution: approx
half of the cdf is to the right of sqrt(n), and since sqrt(n) increases
with increasing n, the encoding is necessarily non-universal.

I also instrumented Huffman on the Zipf distribution to try to better
understand how evenly it splits the subtrees.  Interestingly, no matter
how large I made n, Huffman on Zipf of n symbols always had at least
one subtree where the weights of the two branches differed by more than
10%.  Of course, the efficiency of Huffman doesn't depend upon the largest
unbalance, but upon the average unbalance.

What is the smallest n for which the Huffman encoding of the Zipf
distribution of n symbols has no imbalance exceeding 10% ?

I conjecture that this could be an exceedingly large number, because
the penalty for such an imbalance becomes vanishingly small with
increasing n.