Re: [math-fun] Encoding sorted lists with duplicates

22 Jan 2009

      Okay, encoding differences between successive elements (deltas)
of a sorted list of r integers in { 0 .. n-1 }.

This came from a programming puzzle I saw floating around the
net: sort 1 million 32-bit ints in 2 million bytes of memory.
32 million bits is 4 million bytes, but for r = 1,000,000 and
n = 2^32, a sorted list ought to be representable in

           ( r + n - 1 )!
    log  ( -------------- ) ~= 13,511,284 bits = 1,688,910.5 bytes.
       2    r! ( n - 1 )!

The key is that the total of the deltas is less than 2^32.
Once I started thinking about limiting how fast you use up bits
vs. how fast you close in on that total--as a ratio to apply
to the individual codewords--I was driven against my will to encode
a delta d as

       floor( d / 2^12 ) one bits, a zero bit, and
       twelve binary bits.

With this mixture of unary and binary,
           sizeof(d)   <= 13         +   d  / 2^12
which makes it easy to see that for a million deltas totalling < 2^32,
      sum( sizeof(d) ) <  13,000,000 + 2^32 / 2^12
                        = 13,000,000 + 2^20 = 14,048,576 bits.

Which is about 6/10 of a bit more per codeword than the the best
possible.  Just the fact that it uses a whole number of bits for each
codword accounts for half a bit.  I don't know whether this is an old
trick, I haven't found another solution to the problem (as I
understand it) on the web.

  --Steve

Steve Witham

tags

participants (1)