[math-fun] Zipf, Harmonic series & Egyptian fractions
Zipf's Law says that the probability of the n'th most popular thingy is proportional to 1/n. Clearly this probability exists only if the universe of thingy's is finite. I'm interested in efficient encodings of thingy's. The Shannon-Fano code says to split the thingy's into two groups having approximately equal probabilities, assign on group an initial bit "0" and the other group an inital bit "1", and then recurse within the groups. This means to me that one must express the fraction 1/2 as a sum of probabilities, each of which is proportional to 1/k for some integer k. Is there a general way to do this for arbitrary m, where m is the number of thingy's ? I.e., is there a general scheme for splitting the infinite set {1/k} into two pieces that gives an optimal encoding for every m ? If such a general scheme isn't exactly optimal, what about a scheme that is within epsilon of being optimal ? I presume that this question has already been studied, so I'm primarily looking for a reference. Thanks in advance for any help on this. Henry Baker hbaker1@pipeline.com
On 6/15/06, Henry Baker <hbaker1@pipeline.com> wrote:
Zipf's Law says that the probability of the n'th most popular thingy is proportional to 1/n. Clearly this probability exists only if the universe of thingy's is finite.
I'm interested in efficient encodings of thingy's. The Shannon-Fano code says to split the thingy's into two groups having approximately equal probabilities, assign on group an initial bit "0" and the other group an inital bit "1", and then recurse within the groups. This means to me that one must express the fraction 1/2 as a sum of probabilities, each of which is proportional to 1/k for some integer k.
Is there a general way to do this for arbitrary m, where m is the number of thingy's ? I.e., is there a general scheme for splitting the infinite set {1/k} into two pieces that gives an optimal encoding for every m ? If such a general scheme isn't exactly optimal, what about a scheme that is within epsilon of being optimal ?
Don't you also want to optimise the assignments within each half, and so on recursively down to the leaves of the (search) tree; in which case an appropriate weight function is required to combine the discrepencies at each node? Dunno whether maybe Huffman coding might be at all relevant to this problem --- but you've presumably considered that possibility. Oh, and shouldn't it be "thingies"? Fred Lunnon
Here's a quick approach. Each block of 2^k numbers has approximately the same reciprocal sum (it's probably at least as good an approximation as Zipf's Law is to start with). So break the integers into groups based on floor(log_2(n)), and iterate up the binary digits of that value. E.g., the first divisiion will have groups: 1,4,5,6,7,16,17,18,... 2,3,8,9,10,11,12,13,14,15,32,33,34,... The second iteration divides the first group into: 1,16,17,18,... 4,5,6,7,64,65,66,... and the second into: 2,3,32,33,34,... 8,9,10,11,12,13,14,15,128,129,130,... etc. When you finish this process (depending on how many thingies you have), you will have a group of 2^k thingies for some k, all of them approximately equally likely. In practice, some optimizations may be called for. If there are less than maybe 3/2 2^k thingies, you should probably put the excess into the smallest group instead of making a group by themselves. Franklin T. Adams-Watters -----Original Message----- From: Henry Baker <hbaker1@pipeline.com> Zipf's Law says that the probability of the n'th most popular thingy is proportional to 1/n. Clearly this probability exists only if the universe of thingy's is finite. I'm interested in efficient encodings of thingy's. The Shannon-Fano code says to split the thingy's into two groups having approximately equal probabilities, assign on group an initial bit "0" and the other group an inital bit "1", and then recurse within the groups. This means to me that one must express the fraction 1/2 as a sum of probabilities, each of which is proportional to 1/k for some integer k. Is there a general way to do this for arbitrary m, where m is the number of thingy's ? I.e., is there a general scheme for splitting the infinite set {1/k} into two pieces that gives an optimal encoding for every m ? If such a general scheme isn't exactly optimal, what about a scheme that is within epsilon of being optimal ? I presume that this question has already been studied, so I'm primarily looking for a reference. Thanks in advance for any help on this. Henry Baker hbaker1@pipeline.com _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (3)
-
franktaw@netscape.net -
Fred lunnon -
Henry Baker