Re: [math-fun] Encoding sorted lists with duplicates

23 Jan 2009

      The entertainment so far has been that I found a solution to
a problem, despite being fuzzy about the problem, by working out
how to measure solutions, and having a very good solution I didn't
like (unary, egads, pfeh) fall out of that.

I want to say a few things about how to solve the problem that I
learned after having solved it.*

The length of a codeword is supposed to match the log of the
probability of the corresponding event.  That way, you minimize
the *expected* length of messages.

But this problem wasn't about the expected length, it was about
minimizing the *maximum* length.  So I felt outside my usual
habits.  But there are two ways to push the problem back into
the probability model.

One is, to realize that if all messages have the *same* length,
then that's the maximum.  More, if that's the least possible
single length, then it's also the least possible maximum.

But inventing wholistic codes for arrays is tricky, so I was
trying to code individual numbers, which seems
reasonable because the problem seems...homogeneous across
the array.  But, thinking about the *typical* distribution
seemed unwise for this minimum-maximum problem.

Second, the right analog to probability is that every distinct
array has the same probability, just the inverse of the number of
distinct arrays.  In this case it was the number of choices,
duplicates allowed, of r things (in this case deltas between
entries in a sorted list) taken from n.
With my example where r = 1e6 and n = 2^32,

     ( r + n - 1 )!       13,511,284
     --------------  ~= 2                                   (1)
      r! ( n - 1 )!

The probability of a value for one of the elements in the array is
the fraction of possible arrays where that element has that value.
If each element is encoded according to that probability,
the message will have the right length.

 From (1) above, you can calculate the probability of a delta d.
It's the ratio of the number of arrays after removing d from the
problem, to the number before.  The number r decreases by one
for each delta, and the number n decreases by the size of the delta.

    ( r - 1  +  n - 1 - d )!    r! ( n - 1 )!
    -----------------------------------------               (2)
    ( r - 1 )! ( n - 1 - d)!   ( r + n - 1 )!

Canceling, factoring and rearranging a little...

      r             ( r + n - 2 - d )!  ( n - 1 )!
    -------------   ---------------------------------       (3)
    ( r + n - 1 )   ( r + n - 2 )!      ( n - 1 - d)!

When d = 0, the first factor determines the minimum codeword size.
For my example, about -log2( 1e6 / 2^32 ) ~= 12.07, where my code
used a minimum of 13 bits.

Since the cost is the log of (3), the costs of factors just
add.  The second factor is roughly
                        d
    (     n - 1 - d/2 )
    ( --------------- )                                     (4)
    ( r + n - 2 - d/2 )

So, very roughly, the cost of d is -log2( n/(r+n) ) * d.
For the example that's -log2( 2^32 / (2^32 + 1e6) ) * d ~= d/2977
(vs. my code's d/4096).  This is what dictates the unary part of
the code.

(The two costs turn out pretty close for the total d = n = 2^32:
    1e6 * 12.07 + 2^32 /  2977 ~=  13,512,716  **
    1e6 * 13    + 2^32 /  4096  =  14,048,576               (5)
)

What I've been calculating is the cost of the first delta in the
list.  But the numbers r and n tend to shrink proportionally as
you go through the list, leaving the results the same--mostly &
usually!  But you can compare the number in (5) to the exponent
of (1) as a final check.

The code is a Golumb/Rice code, which implies that deltas in a
sorted list have a geometric distribution, which I'm trying to
rebend my intuition around.

  --Steve
* "You can't fix a Lisp Machine by just power-cycling it with
    no understanding of what's going on!"
** Using a better estimate of the cost of the average first delta,
    -r*log2(r/(r+n-1))-n*log2((n-1-(n/r/2))/(n+r-2-(n/r/2)))
    ~= 13,511,294 -- off by only 10 bits.

Re: [math-fun] Encoding sorted lists with duplicates

Steve Witham