You can improve this using Fibonacci (or Zeckendorf) encoding - every string ends in 11. So we have: 1 11 2 bits 2 011 3 bits 3 0011 4 bits 4 1011 4 bits 5 00011 5 bits 6 10011 5 bits 7 01011 5 bits 8 000011 6 bits etc. Note, by the way, that changing the radix as the number of symbols increases is *not* using the same encoding regardless of the number of symbols. Franklin T. Adams-Watters -----Original Message----- From: Henry Baker <hbaker1@pipeline.com> Franklin, Steve, David, et al: After a little thinking, I reinvented ternary encoding (or arbitrary radix encoding for radices 2^k-1). The idea is similar to normal decimal notation, except that we use a radix of 3 or 7 or 15 or ... The reason for not using binary encoding is that we have to encode a "comma" or something to delimit the end of the integer, so we use up one of the "digits" for this purpose. The larger the radix, the smaller the coding loss. Here are ternary encodings for small integers: 1 1_ (4 bits, 2 bits per character) 2 2_ 4 bits 3 10_ 6 bits 4 11_ 6 bits 5 12_ 6 bits 6 20_ 6 bits 7 21_ 6 bits 8 22_ 6 bits 9 100_ 8 bits ... Actually, since the first digit is always 1 or 2, we can subtract 1 bit from the (ternary) encoding without any loss. The calculation below did not make this optimization. For a Zipf alphabet of 2^20 symbols, this ternary encoding scheme gets an efficiency of 88.4%. For a Zipf alphabet of 2^20 symbols, a radix-7 (septary?) scheme gets an efficiency of 97%. Once again, the smoothness of the Zipf distribution allows us to ignore some of the inefficiencies for the first few symbols. At 02:48 PM 6/20/2006, franktaw@netscape.net wrote:
I'm not sure exactly you define efficiency, but I'm pretty sure that this is impossible. If the first k values use half the available code space, as n goes to infinity this half of the code space will be used for values with asymptotically zero (O(1/log(n)) of the probability distribution. I don't see how this can be anything but inefficient.
Franklin T. Adams-Watters
-----Original Message----- From: Henry Baker <hbaker1@pipeline.com>
Is there a better encoding of Zipf alphabets such that the encoding of the symbol of relative probability 1/i is _independent_ of the alphabet size and the encoding tends towards 100% efficiency asymptotically?