Dear Steve, et al: I haven't had time to grok your whole message, but one thing popped into my mind: asinh(x). As I have discussed before in this forum, asinh(x) is a wonderful log-like function that is linear at the origin, has no (real) singularities, and is very smooth. For some reason, engineers keep reinventing log-like functions which are linear at the origin, but have various kinds of discontinuities; they keep ignoring the asinh solution to this problem. Asinh subsumes floating point arithmetic, since it encodes the sign, the exponent & the mantissa into a single number. Asinh also subsumes "denormalized" floating point, since the "linear"/denormalized portion of the representational range is already handled by the linear portion of the asinh range. If a 40-year anniversary update to HAKMEM is done, I think that asinh deserves an entry. At 12:41 AM 7/19/2008, Steve Witham wrote:
On 2006.06.15, Henry Baker started a thread on binary encodings for Zipfian distributions. That is, distributions like { 1/1, 1/2, 1/3, 1/4, 1/5 ... 1/m }
He asked (at various points in the thread): 1) This must be in the literature, can anyone give a ref? 2) What efficient encoding(s) exist(s)? 3) Although a Zipf distribution must be finite, are there universal encodings (i.e. for infinite ranges) that approach 100% efficiency for { 1/1 .. 1/m } as m gets big? 4) Universal codes give about log2(n) bits to the number n; are there inefficient codes that give, say, sqrt(n) bits?
This letter promotes cumulative probability functions as a handle on all this.
1) References
As far as I recall, no one found any for Zipf distributions in the first round of this thread. I notice this sentence in http://en.wikipedia.org/wiki/Universal_code_(data_compression) :
"With universal codes, the implicit distribution is approximately a power law such as 1 / n^2 (more precisely, a Zipf distribution). "
2) Encoding Zipf for 1 through finite m.
Say you have a relative probability distribution f'( n ) for n = 1..m. This will go easier for both of us if f'(n) is sorted in decreasing order. Now, if there is a simple function f''(x) that approximates f'(n) (& a couple more conditions), we're in business.
e.g., for the Zipf distribution, f'(n) = 1/n; f''(x) = 1/x
One condition is that you need to be able to integrate f''(n), and also find the inverse of the resulting function pretty easily.
e.g. F''(x) = log(x) + c; G''(y) = exp(y - c)
This is the key point: f'(n) is a relative probability distribution; f''(x) is similar and almost a probability density function; F''(x) is a corresponding, almost cumulative probability function. Why settle for almost?
Let F(x) be F''(x) adjusted so that F(.5) = 0 and F( m + .5 ) = 1. Let f(x) be d F(x)/dx and G(y) is the inverse of F(x).
e.g., F(x) = k log( x ) + c; G(x) = exp( (y-c)/k ); f(x) = k/x
(The slight fudge here is that instead of giving n a probability proportional to f'(n) in our code, we're really going to give it the definite integral from n-.5 to n+.5 of f(x) dx. This difference approaches zero for larger numbers and can be re-fudged for smaller numbers, so we'll ignore the issue from now on.)
Now you can pretty much encode the value n as the binary representation of F(n), and decode codeword w with G(w). The length of the codeword for n is approximately -log2( f(n) ) (unrelated to the fact that F() is log-like in the example here).
How exactly though? Here's an existence proof of a code. Say you've got F, G and f now. To encode n, generate F(n-1) (if n>1), F( n ), and F(n+1) (if n < m) to at least a couple more bits of precision than -log2(f(n)) would call for. Use just enough bits of F(n) to distinguish from both neighbors (or the one neighbor if n = 1 or n = m).
I believe the code you get that way uses the whole code space. It'll be fairly efficient, but not as good as a Huffman code. I've been thinking about tricks to make it more efficient, but they would obscure the point that it's the binary representation of the cumulative probability function.
Decoding is a little trickier since what you're given is a message without knowing where the first codeword ends. Nevertheless you treat the whole message as a binary number w, n = G(w) +/- 1, and you consume about -log2(f(n)) bits after looking at the neighbors.
3) Universal codes and their efficiency for Zipf distributions
A "universal" code is one that can encode numbers over an infinite range, e.g. any n >= 1. The whole method above works for infinite ranges if you have a distribution (a positive function whose integral is finite) over the range.
The only problem is you need unlimited-precision arithmetic. In fact, for decoding you need to increase your precision until you know you've gone far enough.
Typical "mixed" methods for universal codes break the number into an exponent and a mantissa, and use binary for the mantissa and some other method for the exponent. But with this method you just look at the values of a single function. One advantage to this is that it can be smoother and therefore more efficient at the low end.
What kind of function, though? One trick is to pick a function F() that, over the range in question (e.g. .5 to infinity) goes from zero to one, and let f() be F's derivative. One F() I was looking at way back when was:
F(x) = 1 - ( 1 / log2( x + 1.5 ) )
Using my soggy brain plus a table of derivatives I get
f(x) = d/dx F(x) = -1/( (x+1.5) (ln 2) (log2(x+1.5))^2 )
This means that the length of the codeword for n, -log2(f(n)) is about
log2(ln 2) + log2(n+1.5) + 2 log2( log2( n+1.5 ) )
In other words, a small negative constant, plus enough bits for a mantissa, plus twice enough bits for an exponent...only creamy and smooth.
As m, the largest n in a Zipf distribution, gets large the constant and the log(log) terms become less significant, so the code based on this function approaches 100% efficiency (slowly though).
It's slightly strange that having the length of the codeword be the log of the number is the natural limit for a universal code. The same is true of the way we write decimal integers. But this corresponds to the probabilities of a Zipf distribution, which is the distribution none of us knew or found a reference to in the coding literature.
Another interesting thing to think about is that no finite-length code can grow more slowly than log2(n) in the limit.
I guess you could optimize for a distribution of distributions, e.g., Zipf distributions for a distribution of m.
By the way, there are known ways to generate infinite Huffman codes, although they mostly work for variations on geometric distributions.
4) Non-optimal and funky universal codes
If you want a universal code whose wordlength grows at a more-than- log rate, say like sqrt(n), then you take -log2(f''(x)) = sqrt(x) f''(x) = 2 ^ (-sqrt(x)) integrate that to get F''(x), then adjust to get F(), f() and G().
An interesting puzzle might be to take a given binary code (e.g, unary, Zeckendorf/Fibonacci, ternary, septary, various mixed universal codes) and see whether the implied distribution resembles a simple function. I think some universal codes relate to the "log*" function, but I don't know whether a natural continuous version of log* has been found.
--Steve p.s. none of this needed to mention my *previous* pet universal tool, histograms of codeword-lengths.