[math-fun] Universal ZipfHuffs

24 Jun 2006

      ...
From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of
the symbol of relative probability 1/i is _independent_ of the alphabet
size and the encoding tends towards 100% efficiency asymptotically?
Henry and others have shown a couple codes that meet this requirement.
...
From: franktaw@netscape.net
I'm not sure exactly you define efficiency,
You divide the entropy by the, er, weighted bits per symbol of the
code.  If P(s) is the probability of symbol s, then the entropy
(in bits) of symbol s is -log2(P(s)), and the entropy for the
whole alphabet is
      the sum of   -log2( P(s) ) * P( s )  over the whole alphabet.
If the code uses bits( s ) to encode symbol s, then the weighted
bits per symbol for the alphabet is
      the sum of   bits( s ) * P( s )  over the whole alphabet.

The efficiency of the code is the entropy divided by the average
bits per symbol.  (Henry, that is the way you're calculating it, right?)
...
but I'm pretty sure that
this is impossible.  If the first k values use half the available code
space, as n goes to infinity this half of the code space will be used
for values with asymptotically zero (O(1/log(n)) of the probability
distribution.  I don't see how this can be anything but inefficient.
I wanted to point out where this intuition went wrong.  The code must be
inefficient, but the inefficiency can approach zero for larger and
larger alphabets, as a ratio of the size of the code word to what it would
ideally be.

"Half of the code space" is the misleading idea.  Half of the
code space means all the words that start with (say) a one bit.  So, if
half the code space is wasted, that means that there is one extra bit
on every word.  If only we could waste just half the code space!
In fact as you use larger and larger alphabets, more and more code space
is wasted on small numbers with less and less probability.  But wasted
bits are just the log of the wasted space.

For the efficiency to approach 100%, the wasted bits just have to grow
more slowly than the ideal size of the word.  That's true if the
they grow like log2( size of the word ), which is like
log2( log2( s ) ).  This is true of a couple codes that have been
discussed here.

A code where this property is easy to see encodes each number by:
     b zero bits, where b is the number of bits in the number of bits in s
     b bits encoding n = number of bits in s
     n bits encoding s

So,
         0 =>          0      1    0  (not used for Zipf distributions)
         1 =>          0      1    1
         2 =>         00     10    10
         3 =>         00     10    11
         4 =>         00     11    100
       256 =>       0000   1001    100000000
     65535 =>      00000  10000    1111111111111111
  2^32 - 1 =>     000000 100000    11111111111111111111111111111111

This code is awful, but (e.g.) the efficiency for a 32 bit number is
at worst 32 / ( 6 + 6 + 32 ) = 73%.

You can see that the first two columns grow like the log of the
size of the third column, so this meets Henry's requirement.

For a kinda large number, take the bits on my hard disk as representing
a single integer.  80 GB = 80 x 2^33 bits < 2^40 bits, so this would
be represented by 40 zero bits, 40 size bits, and ~80 x 2^33 data
bits.  80 wasted bits in 80 GB of data, 99.9999999884% efficiency.
My whole disk packed efficiently into 1/2^80, one millionth of
one millionth of one millionth of one millionth of the code space.

  --Steve

[math-fun] Universal ZipfHuffs

Steve Witham