Re: [math-fun] Getting a complete set

newer
[math-fun] More about virii

older
[math-fun] more newspaper math

Thomas Colthurst

3 Aug 2004 3 Aug '04

2:15 p.m.

Michael Kleber writes:

...

On a related note, suppose I have a bin of N distinct things, but I don't know N. I'm allowed to choose-with-replacement k times, and I can recognize things I've seen before, so that I can build a histogram of how many things I saw t times, for t=1,2,3,... How do I best guess N?

This appears to be a variant of the coupon collector's problem, as discussed in http://www.math.uci.edu/~mfinkels/COUPON.PDF . They give a maximum likelihood estimate of N as the smallest integer j >= C_k satisfying j + 1 ( j )^k ----------- (-----) < 1 j + 1 - C_k ( j+1 ) where C_k is the number of distinct things you saw in your k samples. -Thomas C

Show replies by date

Michael Kleber

3 Aug 3 Aug

2:32 p.m.

New subject: [math-fun] Getting a complete set

Thomas Colthurst wrote:

...

This appears to be a variant of the coupon collector's problem, as discussed in http://www.math.uci.edu/~mfinkels/COUPON.PDF . They give a maximum likelihood estimate of N as the smallest integer j >= C_k satisfying

j + 1 ( j )^k ----------- (-----) < 1 j + 1 - C_k ( j+1 )

where C_k is the number of distinct things you saw in your k samples.

Thanks for the pointer; I'll read through the paper. Both in their maximum likelihood answer above, and in the paper in general, they seem to use only the total number of distinct things seen, and not the distribution of multiplicities with which you saw them. I wonder whether allowing the use of that extra data helps at all. --Michael Kleber

Eugene Salamin

3:37 p.m.

New subject: [math-fun] Getting a complete set

--- Michael Kleber <kleber@brandeis.edu> wrote:

...

Thomas Colthurst wrote:

...
This appears to be a variant of the coupon collector's problem, as discussed in http://www.math.uci.edu/~mfinkels/COUPON.PDF . They give a maximum likelihood estimate of N as the smallest integer j >= C_k satisfying

j + 1 ( j )^k ----------- (-----) < 1 j + 1 - C_k ( j+1 )

where C_k is the number of distinct things you saw in your k samples.

Thanks for the pointer; I'll read through the paper. Both in their maximum likelihood answer above, and in the paper in general, they seem to use only the total number of distinct things seen, and not the distribution of multiplicities with which you saw them. I wonder whether allowing the use of that extra data helps at all.

--Michael Kleber

You certainly do want to use all the available information. What you want is the probability P(n1, n2, ... | N) that you get n1 objects once, n2 objects twice, etc., where 1 n1 + 2 n2 + ... = n, n is the number of draws, and N is the unknown actual number of objects. By Bayes theorem, this equals P(N | n1, n2, ...), the relative likelihood that there are N objects, based upon the observation n1, n2, ... . Multiply this by the prior probability P0(N), and you get the (relative) posterior probability for N. I don't have time to work out the general case, so I'll just do an example with n = 3 draws. P(0,0,1|N) = 1/N^2 (one object thrice). P(3,0,0|N) = (N-1)(N-2)/N^2 (3 objects once each). P(1,1,0|N) = 3(N-1)/N^2 (1 object once, another twice). If you get 3 different objects, you know that N can't be 1 or 2, and that N=3 has become 2/9 less likely than any one large value of N. But you can't discern between different large values of N, and for these, you have no further information beyond that contained in your prior. If you get the same object three times, you get a convergent posterior, even if you pick the improper uniform prior. Common sense predicts that when each object is observed multiple times, the posterior will sharply select a unique N. On the other hand, a distribution with a long tail of objects appearing just once will provide little information at large N. An observation (n1, n2, ...) that is atypical of any N may lead one, in a real world situation, to entertain additional hypotheses, for example, that we have been disinformed. Problems like this one, Bayes theorem, scientific inference, probability and common sense, probability as an extension of deductive logic, these are the subject matter of Edwin T. Jaynes' book "Probability, the Logic of Science". Gene __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail

Michael Kleber

6:49 p.m.

New subject: [math-fun] Getting a complete set

Eugene Salamin wrote:

...

Common sense predicts that when each object is observed multiple times, the posterior will sharply select a unique N. On the other hand, a distribution with a long tail of objects appearing just once will provide little information at large N.

Unfortunately, I expect that my real-world application will turn out to be the latter case; that my number of samples will be small enough that I'll be lucky to ever see anything three times. (In which case, I suppose, knowing the total number of distinct elements seen is all the information there is!) On Thomas's suggestion (off-list), I had already started trying to work out the Bayesian approach, but to say that my skills here are rusty would be misrepresenting the presence of some mettle (heh) in the first place. Maybe tomorrow I'll try to work out the case where each thing is seen at most three times.

...

An observation (n1, n2, ...) that is atypical of any N may lead one, in a real world situation, to entertain additional hypotheses, for example, that we have been disinformed.

Yes, that's quite likely for me too. (In particular, my ability to recognize when two draws are actually the same or different is almost certainly imperfect.)

...

Problems like this one, Bayes theorem, scientific inference, probability and common sense, probability as an extension of deductive logic, these are the subject matter of Edwin T. Jaynes' book "Probability, the Logic of Science".

On that note, let me announce my change in employment. Having been a math professor at MIT and Brandeis for six years, I've just started doing something different. As of last month, I'm working at the Broad (rhymes with "road") Institute, an umbrella organization for research on genomics in medicine, affiliated with MIT, Harvard, and the Whitehead Institute (which the whole structure was a part of until last November). This is Eric Lander's institute, the birthplace of the Human Genome Project and probably the world's leading center for genome sequencing. You may have seen the news story two or three weeks ago that the dog genome had just been released; that was us. I'm in the Whole Genome Assembly group. We do "shotgun assembly" : the people in the laboratory take the DNA from an organism and chop it up into lots of little pieces, and read the sequences of letters near each end of each little piece; we take all the pieces and put them back together. Immediate change: the old mathematician's dilemma of how to deal with the question "But what real use is your work?" is entirely gone. "Oh, we're going to cure cancer" is sort of an ace in the hole... --Michael Kleber kleber@broad.mit.edu

Eugene Salamin

4 Aug 4 Aug

10:24 a.m.

New subject: [math-fun] Getting a complete set

--- Michael Kleber <kleber@brandeis.edu> wrote:

...

Eugene Salamin wrote:

...
Common sense predicts that when each object is observed multiple times, the posterior will sharply select a unique N. On the other hand, a distribution with a long tail of objects appearing just once will provide little information at large N.

I was a bit confused here about that second sentence. If N is very large, then all n selections will be different. So it is only in this case that the likelihood is a flat function of N for large N. Even a single duplication must make the likeluhood fall off with increasing N, but as my example showed, the fall off need not be sufficient to ensure that the posterior is a proper probability (i.e. the sum over N could diverge).

...

Unfortunately, I expect that my real-world application will turn out to be the latter case; that my number of samples will be small enough that I'll be lucky to ever see anything three times. (In which case, I suppose, knowing the total number of distinct elements seen is all the information there is!)

Let N be the unknown actual number of objects. Let n be the number of draws, with replacement. Let d be the number of distinct objects found. Let P(d|N,n) be the probability of getting d when N and n are known (so that the sum of P(d|N,n) over all d equals 1). Then P(d|N,n) equals n! times binomial(N,d) times the coefficient of t^d in [exp(t/N)-1]^d. By Bayes theorem, this equals the likelihood of N given the observed n and d.

...

On Thomas's suggestion (off-list), I had already started trying to work out the Bayesian approach, but to say that my skills here are rusty would be misrepresenting the presence of some mettle (heh) in the first place. Maybe tomorrow I'll try to work out the case where each thing is seen at most three times.

Start with Jaynes. Publisher's web page ( http://us.cambridge.org/titles/catalogue.asp?isbn=0521592712 ). It's a fun book to read, the math is not too difficult, it's full of history and amusing paradoxes. It's also 758 pages. Also see the official E. T. Jaynes web site at ( http://bayes.wustl.edu/ ).

...

...
An observation (n1, n2, ...) that is atypical of any N may lead one, in a real world situation, to entertain additional hypotheses, for example, that we have been disinformed.

Yes, that's quite likely for me too. (In particular, my ability to recognize when two draws are actually the same or different is almost certainly imperfect.)

...
Problems like this one, Bayes theorem, scientific inference, probability and common sense, probability as an extension of deductive logic, these are the subject matter of Edwin T. Jaynes' book "Probability, the Logic of Science".

On that note, let me announce my change in employment. Having been a math professor at MIT and Brandeis for six years, I've just started doing something different. As of last month, I'm working at the Broad (rhymes with "road") Institute, an umbrella organization for research on genomics in medicine, affiliated with MIT, Harvard, and the Whitehead Institute (which the whole structure was a part of until last November).

This is Eric Lander's institute, the birthplace of the Human Genome Project and probably the world's leading center for genome sequencing. You may have seen the news story two or three weeks ago that the dog genome had just been released; that was us.

I'm in the Whole Genome Assembly group. We do "shotgun assembly" : the people in the laboratory take the DNA from an organism and chop it up into lots of little pieces, and read the sequences of letters near each end of each little piece; we take all the pieces and put them back together.

Immediate change: the old mathematician's dilemma of how to deal with the question "But what real use is your work?" is entirely gone. "Oh, we're going to cure cancer" is sort of an ace in the hole...

--Michael Kleber kleber@broad.mit.edu

Since you will do doing real world scientific inference, you certainly will need to have a working knowledge of Bayesian methodology. So again, I recommend Jaynes. Gene __________________________________ Do you Yahoo!? Yahoo! Mail is new and improved - Check it out! http://promotions.yahoo.com/new_mail

Eugene Salamin

10:35 a.m.

New subject: [math-fun] Getting a complete set

Correction. Let N be the unknown actual number of objects. Let n be the number of draws, with replacement. Let d be the number of distinct objects found. Let P(d|N,n) be the probability of getting d when N and n are known (so that the sum of P(d|N,n) over all d equals 1). Then P(d|N,n) equals n! times binomial(N,d) times the coefficient of t^n in [exp(t/N)-1]^d. By Bayes theorem, this equals the likelihood of N given the observed n and d. I originally said t^d instead of t^n. Gene __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail

7779

Age (days ago)

7780

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Eugene Salamin
Michael Kleber
Thomas Colthurst