Re: [math-fun] Zipf, codes & cumulative probability

24 Jul 2008

      Dear Steve, et al:

I haven't had time to grok your whole message, but one thing popped into my mind: asinh(x).  As I have discussed before in this forum, asinh(x) is a wonderful log-like function that is linear at the origin, has no (real) singularities, and is very smooth.

For some reason, engineers keep reinventing log-like functions which are linear at the origin, but have various kinds of discontinuities; they keep ignoring the asinh solution to this problem.

Asinh subsumes floating point arithmetic, since it encodes the sign, the exponent & the mantissa into a single number.

Asinh also subsumes "denormalized" floating point, since the "linear"/denormalized portion of the representational range is already handled by the linear portion of the asinh range.

If a 40-year anniversary update to HAKMEM is done, I think that asinh deserves an entry.

At 12:41 AM 7/19/2008, Steve Witham wrote:
...
On 2006.06.15, Henry Baker started a thread on binary encodings for
Zipfian distributions.  That is, distributions like
   { 1/1, 1/2, 1/3, 1/4, 1/5 ... 1/m }
He asked (at various points in the thread):
   1) This must be in the literature, can anyone give a ref?
   2) What efficient encoding(s) exist(s)?
   3) Although a Zipf distribution must be finite, are there
      universal encodings (i.e. for infinite ranges) that approach
      100% efficiency for { 1/1 .. 1/m } as m gets big?
   4) Universal codes give about log2(n) bits to the number n;
      are there inefficient codes that give, say, sqrt(n) bits?
This letter promotes
          cumulative probability functions
as a handle on all this.
1) References
As far as I recall, no one found any for Zipf distributions in the
first round of this thread.  I notice this sentence in
    http://en.wikipedia.org/wiki/Universal_code_(data_compression) :
"With universal codes, the implicit distribution is approximately a
power law such as 1 / n^2 (more precisely, a Zipf distribution). "
2) Encoding Zipf for 1 through finite m.
Say you have a relative probability distribution f'( n ) for n = 1..m.
This will go easier for both of us if f'(n) is sorted in decreasing
order.  Now, if there is a simple function f''(x) that approximates
f'(n) (& a couple more conditions), we're in business.
e.g., for the Zipf distribution, f'(n) = 1/n;  f''(x) = 1/x
One condition is that you need to be able to integrate f''(n), and
also find the inverse of the resulting function pretty easily.
e.g.  F''(x) = log(x) + c;  G''(y) = exp(y - c)
This is the key point: f'(n) is a relative probability distribution;
f''(x) is similar and almost a probability density function; F''(x)
is a corresponding, almost cumulative probability function.  Why
settle for almost?
Let F(x) be F''(x) adjusted so that F(.5) = 0 and F( m + .5 ) = 1.
Let f(x) be d F(x)/dx and G(y) is the inverse of F(x).
e.g., F(x) = k log( x ) + c;  G(x) = exp( (y-c)/k );  f(x) = k/x
(The slight fudge here is that instead of giving n a probability
proportional to f'(n) in our code, we're really going to give it the
definite integral from n-.5 to n+.5 of f(x) dx.  This difference
approaches zero for larger numbers and can be re-fudged for smaller
numbers, so we'll ignore the issue from now on.)
Now you can pretty much encode the value n as the binary representation
of F(n), and decode codeword w with G(w).  The length of the codeword
for n is approximately -log2( f(n) ) (unrelated to the fact that F()
is log-like in the example here).
How exactly though?  Here's an existence proof of a code.  Say you've
got F, G and f now.  To encode n, generate
         F(n-1)  (if n>1),
         F( n ),   and
         F(n+1)  (if n < m)
to at least a couple more bits of precision than -log2(f(n)) would call
for.  Use just enough bits of F(n) to distinguish from both neighbors
(or the one neighbor if n = 1 or n = m).
I believe the code you get that way uses the whole code space.  It'll
be fairly efficient, but not as good as a Huffman code.  I've been
thinking about tricks to make it more efficient, but they would
obscure the point that it's
   the binary representation of the cumulative probability function.
Decoding is a little trickier since what you're given is a message
without knowing where the first codeword ends.  Nevertheless you
treat the whole message as a binary number w, n = G(w) +/- 1, and you
consume about -log2(f(n)) bits after looking at the neighbors.
3) Universal codes and their efficiency for Zipf distributions
A "universal" code is one that can encode numbers over an infinite
range, e.g. any n >= 1.  The whole method above works for infinite
ranges if you have a distribution (a positive function whose integral
is finite) over the range.
The only problem is you need unlimited-precision arithmetic.
In fact, for decoding you need to increase your precision until you
know you've gone far enough.
Typical "mixed" methods for universal codes break the number into an
exponent and a mantissa, and use binary for the mantissa and some
other method for the exponent.  But with this method you just look
at the values of a single function.  One advantage to this is that
it can be smoother and therefore more efficient at the low end.
What kind of function, though?  One trick is to pick a function F()
that, over the range in question (e.g. .5 to infinity) goes from zero
to one, and let f() be F's derivative.  One F() I was looking at
way back when was:
F(x) = 1 - ( 1 / log2( x + 1.5 ) )
Using my soggy brain plus a table of derivatives I get
f(x) = d/dx F(x) = -1/( (x+1.5) (ln 2) (log2(x+1.5))^2 )
This means that the length of the codeword for n, -log2(f(n)) is about
log2(ln 2) + log2(n+1.5) + 2 log2( log2( n+1.5 ) )
In other words, a small negative constant, plus enough bits for a mantissa,
plus twice enough bits for an exponent...only creamy and smooth.
As m, the largest n in a Zipf distribution, gets large the constant and
the log(log) terms become less significant, so the code based on this
function approaches 100% efficiency (slowly though).
It's slightly strange that having the length of the codeword be the log
of the number is the natural limit for a universal code.  The same is
true of the way we write decimal integers.  But this corresponds to
the probabilities of a Zipf distribution, which is the distribution
none of us knew or found a reference to in the coding literature.
Another interesting thing to think about is that no finite-length code
can grow more slowly than log2(n) in the limit.
I guess you could optimize for a distribution of distributions, e.g.,
Zipf distributions for a distribution of m.
By the way, there are known ways to generate infinite Huffman codes,
although they mostly work for variations on geometric distributions.
4) Non-optimal and funky universal codes
If you want a universal code whose wordlength grows at a more-than-
log rate, say like sqrt(n), then you take
     -log2(f''(x)) = sqrt(x)
     f''(x) = 2 ^ (-sqrt(x))
integrate that to get F''(x), then adjust to get F(), f() and G().
An interesting puzzle might be to take a given binary code (e.g,
unary, Zeckendorf/Fibonacci, ternary, septary, various mixed universal
codes) and see whether the implied distribution resembles a simple
function.  I think some universal codes relate to the "log*" function,
but I don't know whether a natural continuous version of log* has been
found.
--Steve
p.s. none of this needed to mention my *previous* pet universal tool,
histograms of codeword-lengths.