[math-fun] The importance of being a coupon-collector

Fred Lunnon

3 Feb 2016 3 Feb '16

10:13 a.m.

Mama don't allow no p*l*t*c*l probability here: so I've gone back to collecting --- computationally rather expensive --- coupons. [ A man should have an occupation of some sort. ] I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the probability that n distinct coupons are available? What is the most likely value of n ? What are asymptotic expressions for large m ? Fred Lunnon

Show replies by date

Allan Wechsler

3 Feb 3 Feb

11:41 a.m.

I have wrestled with this problem before, and I *think* that the problem is not well-formed as stated. There needs to be a prior distribution, a given probability that n takes on different values. I started collecting Nantucket Nectars Iced Tea caps at some point, and when I got my first duplication I spent some hours trying to figure out what I now knew about the total number of unique caps. As I recall, it turned out that the answer was "nothing except that it's at least 17". On Wed, Feb 3, 2016 at 12:13 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

...

Mama don't allow no p*l*t*c*l probability here: so I've gone back to collecting --- computationally rather expensive --- coupons. [ A man should have an occupation of some sort. ]

I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the probability that n distinct coupons are available? What is the most likely value of n ? What are asymptotic expressions for large m ?

Fred Lunnon

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Erich Friedman

11:55 a.m.

...

I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?

i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach. assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability. but i don't know how to approach the asymptotics. erich

Allan Wechsler

12:34 p.m.

This cannot be right -- something is not getting defined properly. Let's look at Fred's first question: having drawn m coupons with k distinct, what is the probability that there are n available values? Suppose m=k=1! If we follow Erich Friedman's suggestion, we get a probability of 1 for all values of n! A little less flippantly: suppose m=k=2. We have drawn two distinct coupons; In Friedman's method, the probability of this outcome in a 1-coupon universe is 0; in a 2-coupon universe it's 1/2. In a 3-coupon universe it's 2/3. In a universe with n coupons, the probability that the first two draws are distinct is (n-1)/n -- so Friedman's method prefers larger values of n, without limit. If I draw 2 distinct coupons, Friedman says it is more likely that there are two million coupons than that there are six. Something is wrong here; I apologize that I have not been able to point out the problem more coherently. On Wed, Feb 3, 2016 at 1:55 PM, Erich Friedman <erichfriedman68@gmail.com> wrote:

...

...
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?

i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach.

assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability.

but i don't know how to approach the asymptotics.

erich _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Erich Friedman

12:39 p.m.

i should have mentioned that this maximal likelihood approach only works if you get at least two different types of coupons. otherwise, it gives reasonable answers. for example, try k=2 distinct coupons out of m=3 coupons collected. if you don't like the idea of 2 million coupons being "likely", then you have to set some a priori distribution of some smaller number of coupons, and abandon his approach. erich

...

On Feb 3, 2016, at 2:34 PM, Allan Wechsler <acwacw@gmail.com> wrote:

This cannot be right -- something is not getting defined properly. Let's look at Fred's first question: having drawn m coupons with k distinct, what is the probability that there are n available values? Suppose m=k=1! If we follow Erich Friedman's suggestion, we get a probability of 1 for all values of n!

A little less flippantly: suppose m=k=2. We have drawn two distinct coupons; In Friedman's method, the probability of this outcome in a 1-coupon universe is 0; in a 2-coupon universe it's 1/2. In a 3-coupon universe it's 2/3. In a universe with n coupons, the probability that the first two draws are distinct is (n-1)/n -- so Friedman's method prefers larger values of n, without limit. If I draw 2 distinct coupons, Friedman says it is more likely that there are two million coupons than that there are six. Something is wrong here; I apologize that I have not been able to point out the problem more coherently.

On Wed, Feb 3, 2016 at 1:55 PM, Erich Friedman <erichfriedman68@gmail.com> wrote:

...
...
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?

i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach.

assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability.

but i don't know how to approach the asymptotics.

erich _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Dan Asimov

3:19 p.m.

I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether. (Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.) Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R. Also, rename the points according to their order: x_(1) < x_(2) < ... < x_(n) . Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t. So the density of the maximum x_(n) is d_max(t) = n t^(n-1), so its expected value is E(x_(n)) = Integral_{0<=t<=1} n t^n dt = n/(n+1) . Symmetrically, the expected value of x_(1) = min{x_j} is E(x_(1)) = 1/(n+1) . The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is (*) d_min(t) = n (1-t)^(n-1). The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z. So the setup is the same as n+1 points uniformly distributed on C. Therefore by symmetry we can conclude that the length of each interval between successive points x_(1) < x_(2) < ... < x_(n) has the same density (*) as does x_(1) = x_(1) - 0. Now suppose we don't know the length L of the original interval, which we assume to be [0,L]. ((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of ML(n) = L/(x_(n) - x_(1)). This can be used to infer L as L = approx. ML(n)* (x_(n) - x_(1)). ))) BUT: For now, back to the unit interval [0,1]: Wikipedia states that the joint density of u = x_(1) and v = x_(n) is f(u,v) = (n! / (n-2)!) (v-u)^(n-2) At this point I have to go; maybe more later. —Dan

...

On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

. . .

...

I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.

...

What is the probability that n distinct coupons are available?

...

What is the most likely value of n ?

...

What are asymptotic expressions for large m ?

Fred Lunnon

4:47 p.m.

I had heard Dan's serial-number anecdote as well, though my informant claimed it related to rockets. [ I did often wonder how one retrieved such data from the exploded fragments, but supposed the missiles might often fail to detonate ... ] And I had that in mind when posing this problem. But here we are sampling with replacement; and no distance measure seems available for coupons! Is the problem well posed, doubts Alan. Well, the expected number of trials to acquire all n coupons asymptotically equals n log n --- a result which, while still a student, I succeeded in solving at the cost of several pages of convoluted algebra, which I later proudly showed to an older colleague. Who proceeded to derive the same result in a single line. A similar fate has overtaken most of my forays into probability. Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree? Fred Lunnon On 2/3/16, Dan Asimov <dasimov@earthlink.net> wrote:

...

I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether.

(Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.)

Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R.

Also, rename the points according to their order:

x_(1) < x_(2) < ... < x_(n)

.

Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t.

So the density of the maximum x_(n) is

d_max(t) = n t^(n-1),

so its expected value is

E(x_(n)) = Integral_{0<=t<=1} n t^n dt

= n/(n+1)

. Symmetrically, the expected value of x_(1) = min{x_j} is

E(x_(1)) = 1/(n+1)

. The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is

(*) d_min(t) = n (1-t)^(n-1).

The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z.

So the setup is the same as n+1 points uniformly distributed on C.

Therefore by symmetry we can conclude that the length of each interval between successive points

x_(1) < x_(2) < ... < x_(n)

has the same density (*) as does x_(1) = x_(1) - 0.

Now suppose we don't know the length L of the original interval, which we assume to be [0,L].

((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of

ML(n) = L/(x_(n) - x_(1)).

This can be used to infer L as

L = approx. ML(n)* (x_(n) - x_(1)). )))

BUT: For now, back to the unit interval [0,1]:

Wikipedia states that the joint density

of u = x_(1) and v = x_(n) is

f(u,v) = (n! / (n-2)!) (v-u)^(n-2)

At this point I have to go; maybe more later.

—Dan

...
On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

. . .

...
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.

...
What is the probability that n distinct coupons are available?

...
What is the most likely value of n ?

...
What are asymptotic expressions for large m ?

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Mike Stay

5:02 p.m.

https://en.wikipedia.org/wiki/German_tank_problem On Wed, Feb 3, 2016 at 3:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

...

I had heard Dan's serial-number anecdote as well, though my informant claimed it related to rockets. [ I did often wonder how one retrieved such data from the exploded fragments, but supposed the missiles might often fail to detonate ... ] And I had that in mind when posing this problem. But here we are sampling with replacement; and no distance measure seems available for coupons!

Is the problem well posed, doubts Alan. Well, the expected number of trials to acquire all n coupons asymptotically equals n log n --- a result which, while still a student, I succeeded in solving at the cost of several pages of convoluted algebra, which I later proudly showed to an older colleague. Who proceeded to derive the same result in a single line. A similar fate has overtaken most of my forays into probability.

Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?

Fred Lunnon

On 2/3/16, Dan Asimov <dasimov@earthlink.net> wrote:

...
I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether.

(Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.)

Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R.

Also, rename the points according to their order:

x_(1) < x_(2) < ... < x_(n)

.

Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t.

So the density of the maximum x_(n) is

d_max(t) = n t^(n-1),

so its expected value is

E(x_(n)) = Integral_{0<=t<=1} n t^n dt

= n/(n+1)

. Symmetrically, the expected value of x_(1) = min{x_j} is

E(x_(1)) = 1/(n+1)

. The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is

(*) d_min(t) = n (1-t)^(n-1).

The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z.

So the setup is the same as n+1 points uniformly distributed on C.

Therefore by symmetry we can conclude that the length of each interval between successive points

x_(1) < x_(2) < ... < x_(n)

has the same density (*) as does x_(1) = x_(1) - 0.

Now suppose we don't know the length L of the original interval, which we assume to be [0,L].

((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of

ML(n) = L/(x_(n) - x_(1)).

This can be used to infer L as

L = approx. ML(n)* (x_(n) - x_(1)). )))

BUT: For now, back to the unit interval [0,1]:

Wikipedia states that the joint density

of u = x_(1) and v = x_(n) is

f(u,v) = (n! / (n-2)!) (v-u)^(n-2)

At this point I have to go; maybe more later.

—Dan

...
On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

. . .

...
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.

...
What is the probability that n distinct coupons are available?

...
What is the most likely value of n ?

...
What are asymptotic expressions for large m ?

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com

Dan Asimov

5:34 p.m.

Aha, so it was W.W. II, not W.W. I. Hey, I was only off by 1. —Dan

...

On Feb 3, 2016, at 4:02 PM, Mike Stay <metaweta@gmail.com> wrote:

https://en.wikipedia.org/wiki/German_tank_problem <https://en.wikipedia.org/wiki/German_tank_problem>

Gareth McCaughan

5:06 p.m.

On 03/02/2016 23:46, Fred Lunnon wrote:

...

Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?

With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results. (But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.) -- g

Fred Lunnon

5:47 p.m.

I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:

...

On 03/02/2016 23:46, Fred Lunnon wrote:

...
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?

With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.

(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)

-- g

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Mike Stay

5:56 p.m.

On Wed, Feb 3, 2016 at 4:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

...

I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL

Well, there's no uniform distribution on a countably infinite set, so you have to choose some other one. May I suggest the universal prior from Kolmogorov complexity?

...

On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:

...
On 03/02/2016 23:46, Fred Lunnon wrote:

...
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?

With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.

(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)

-- g

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com

Fred Lunnon

7:44 p.m.

Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 . The fact that classical probability cannot cope with a uniform denumerable distribution does not preclude the existence of a perfectly definite limiting answer as upper bound -> infinity. Fred Lunnon On 2/4/16, Mike Stay <metaweta@gmail.com> wrote:

...

On Wed, Feb 3, 2016 at 4:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:

...
I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL

Well, there's no uniform distribution on a countably infinite set, so you have to choose some other one. May I suggest the universal prior from Kolmogorov complexity?

...
On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:

...
On 03/02/2016 23:46, Fred Lunnon wrote:

...
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?

With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.

(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)

-- g

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

Gareth McCaughan

4 Feb 4 Feb

2:22 a.m.

On 04/02/2016 02:44, Fred Lunnon wrote:

...

Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 .

I think this (suitably generalized) is probably correct. (I haven't actually done any of the relevant calculations.) It's equivalent to max likelihood. But Allan's original claim wasn't that max likelihood doesn't give a definite answer, it was that since you may have prior opinions about how likely any given n is, you need to take them into account when choosing an n in light of the evidence. And for that there's no particular reason why you should use max-likelihood -- and the fact that for k=m max-likelihood says bigger n are always better seems sufficient to indicate that it isn't a good general answer. -- g

Fred Lunnon

8:47 a.m.

Allan complained that maximum likelihood breaks down when k = m (every coupon in sample occurring once). However this behaviour is perfectly appropriate, since such a sample is obviously too small to yield any further information, other than " m is too small" ! Now I (think that I) at last have a usable expression for the probability that just n-l (ell) from n coupons occur in m trials: p(n, l, m) = SUM_{l <= i <= n} (-1)^(l-i) (1 - i/n)^m ( n! / (n-i)! (i-l)! l! ) --- it would be much appreciated if some kind soul could validate this! Feeding in my somewhat parsimonious data yields approx. p(4, 0, 18) = 0.98 , p(5, 1, 18) = 0.088 , p(6, 2, 18) = 0.0099 , p(7, 3, 18) = 0.0014 , ... The probabilities drop off exponentially at first, though eventually their successive ratio approaches unity. If I have correctly understood the maximum likelihood principle, this constitutes statistically clear evidence that n = 4 in this case; though at this stage I'm not entirely sure how best to attach a numerical indicator of significance to these numbers. Fred Lunnon On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:

...

On 04/02/2016 02:44, Fred Lunnon wrote:

...
Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 .

I think this (suitably generalized) is probably correct. (I haven't actually done any of the relevant calculations.) It's equivalent to max likelihood.

But Allan's original claim wasn't that max likelihood doesn't give a definite answer, it was that since you may have prior opinions about how likely any given n is, you need to take them into account when choosing an n in light of the evidence. And for that there's no particular reason why you should use max-likelihood -- and the fact that for k=m max-likelihood says bigger n are always better seems sufficient to indicate that it isn't a good general answer.

-- g

_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun

3578

Age (days ago)

3579

Last active (days ago)

List overview

Download

14 comments

7 participants

participants (7)

Allan Wechsler
Dan Asimov
Dan Asimov
Erich Friedman
Fred Lunnon
Gareth McCaughan
Mike Stay

[math-fun] The importance of being a coupon-collector

tags

participants (7)