[math-fun] Re: More ZipfHuffing
Hi Steve & David: I duplicated Steve's results with my own version of Huffman encoding, so I agree with his results 100%. David's suggestion is slightly inferior to pure Huffman. Huffman gives outstanding efficiency for these Zipf distributions -- e.g., 99.67% efficient for the first 16,384 elements of the series 1/i (i=1 16384). So it's a little hard to argue for the need for a more efficient encoding. The reason for this exceptional efficiency seems to be that Zipf is a very smooth distribution. Here its "long tail" actually helps, since any bad encoding of the most frequent 2 or 3 symbols won't affect the overall efficiency very much. However, even the simplest encoding (14 bits) is 71% efficient, and appears to be in the 70% range even for the first million elements of the harmonic series. Once again, the long tail helps, because the bulk of the tail is reasonbly efficiently encoded. An interesting connection between Huffman and Egyptian fractions is the inner step of Huffman: Sort the symbols into ascending order of probabilities. Take the symbols with the two smallest probabilities and "merge" them, thus adding their probabilities. Then bubble this probability up the list until it is sorted again. (This is obviously not the most efficient algorithm; I'm focussed on the probability numbers themselves.) We are obviously building up sums from Egyptian fractions (probabilities not normalized), with the topmost sum being H_n (nth harmonic number). At 12:17 AM 6/17/2006, Steve Witham wrote:
Henry,
Thanks for an interesting question. Lots of web pages mention Zipf and Huffman in the same sentence, but no one seems to have asked just what you asked before.
Dusted off and was forced to debug a Huffman-coding routine. Here it is generating codes for various populations of thingies. The digits indicate the number of bits to represent each thingy. Line 16 matches the example I gave last night; my hand-calculations were close enough.
Line 15 seems perfect. It represents an ideal of what normally (or at least sometimes) happens, but I don't think it ever happens quite that way again. But, I'm not actually sure, more exploring to do.
--Steve
1: 0 2: 11 3: 122 4: 1233 5: 12344 6: 133344 7: 1334444 8: 22334444 9: 223344455 10: 2234444455 11: 22344445555 12: 233344445555 13: 2333444555555 14: 23344444555555 15: 233444455555555 16: 2334444555555566 17: 23344445555556666 18: 233444555555556666 19: 2334445555555666666 20: 23344455555566666666 21: 233444555556666666666 22: 2334455555556666666666 23: 23344555555666666666666 24: 234444555555666666666666 25: 2344445555556666666666677 26: 23444455555666666666666677 27: 234444555556666666666667777 28: 2344455555556666666666667777 29: 23444555555566666666666777777 30: 234445555556666666666666777777 31: 2344455555566666666666677777777 32: 23444555555666666666667777777777 33: 234445555556666666666777777777777 34: 2344455555666666666666777777777777 35: 23444555556666666666677777777777777 36: 234445555566666666667777777777777777 37: 2344455555666666666777777777777777777 38: 23444555566666666666777777777777777777 39: 234445555666666666677777777777777777777 40: 2344555555666666666677777777777777777777 41: 23445555556666666666777777777777777777788 42: 234455555566666666677777777777777777777788 43: 2344555555666666666777777777777777777778888 44: 23445555566666666666777777777777777777778888 45: 234455555666666666667777777777777777777888888 46: 2344555556666666666777777777777777777777888888 47: 23445555566666666667777777777777777777788888888 48: 234455555666666666677777777777777777778888888888 49: 2344555556666666666777777777777777777888888888888 50: 23445555566666666677777777777777777777888888888888 51: 234455555666666666777777777777777777788888888888888 52: 2344555556666666667777777777777777778888888888888888 53: 23445555566666666677777777777777777888888888888888888 54: 234455555666666667777777777777777777888888888888888888 55: 2344555556666666677777777777777777788888888888888888888 56: 23445555666666666677777777777777777788888888888888888888 57: 234455556666666666777777777777777778888888888888888888888 58: 2344555566666666677777777777777777778888888888888888888888 59: 23445555666666666777777777777777777888888888888888888888888 60: 234455556666666667777777777777777788888888888888888888888888 61: 2344555566666666677777777777777778888888888888888888888888888 62: 23445555666666667777777777777777778888888888888888888888888888 63: 234455556666666677777777777777777888888888888888888888888888888 64: 2344555566666666777777777777777788888888888888888888888888888888 65: 23445555666666667777777777777777888888888888888888888888888888899 66: 234455556666666677777777777777788888888888888888888888888888888899 67: 2344555566666666777777777777777888888888888888888888888888888889999 68: 23445555666666677777777777777777888888888888888888888888888888889999 69: 234455556666666777777777777777778888888888888888888888888888888999999 70: 2344555566666667777777777777777888888888888888888888888888888888999999 71: 23445555666666677777777777777778888888888888888888888888888888899999999
participants (1)
-
Henry Baker