On 6/15/06, Henry Baker <hbaker1@pipeline.com> wrote:
Zipf's Law says that the probability of the n'th most popular thingy is proportional to 1/n. Clearly this probability exists only if the universe of thingy's is finite.
I'm interested in efficient encodings of thingy's. The Shannon-Fano code says to split the thingy's into two groups having approximately equal probabilities, assign on group an initial bit "0" and the other group an inital bit "1", and then recurse within the groups. This means to me that one must express the fraction 1/2 as a sum of probabilities, each of which is proportional to 1/k for some integer k.
Is there a general way to do this for arbitrary m, where m is the number of thingy's ? I.e., is there a general scheme for splitting the infinite set {1/k} into two pieces that gives an optimal encoding for every m ? If such a general scheme isn't exactly optimal, what about a scheme that is within epsilon of being optimal ?
Don't you also want to optimise the assignments within each half, and so on recursively down to the leaves of the (search) tree; in which case an appropriate weight function is required to combine the discrepencies at each node? Dunno whether maybe Huffman coding might be at all relevant to this problem --- but you've presumably considered that possibility. Oh, and shouldn't it be "thingies"? Fred Lunnon