[math-fun] The importance of being a coupon-collector
Mama don't allow no p*l*t*c*l probability here: so I've gone back to collecting --- computationally rather expensive --- coupons. [ A man should have an occupation of some sort. ] I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the probability that n distinct coupons are available? What is the most likely value of n ? What are asymptotic expressions for large m ? Fred Lunnon
I have wrestled with this problem before, and I *think* that the problem is not well-formed as stated. There needs to be a prior distribution, a given probability that n takes on different values. I started collecting Nantucket Nectars Iced Tea caps at some point, and when I got my first duplication I spent some hours trying to figure out what I now knew about the total number of unique caps. As I recall, it turned out that the answer was "nothing except that it's at least 17". On Wed, Feb 3, 2016 at 12:13 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
Mama don't allow no p*l*t*c*l probability here: so I've gone back to collecting --- computationally rather expensive --- coupons. [ A man should have an occupation of some sort. ]
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the probability that n distinct coupons are available? What is the most likely value of n ? What are asymptotic expressions for large m ?
Fred Lunnon
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?
i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach. assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability. but i don't know how to approach the asymptotics. erich
This cannot be right -- something is not getting defined properly. Let's look at Fred's first question: having drawn m coupons with k distinct, what is the probability that there are n available values? Suppose m=k=1! If we follow Erich Friedman's suggestion, we get a probability of 1 for all values of n! A little less flippantly: suppose m=k=2. We have drawn two distinct coupons; In Friedman's method, the probability of this outcome in a 1-coupon universe is 0; in a 2-coupon universe it's 1/2. In a 3-coupon universe it's 2/3. In a universe with n coupons, the probability that the first two draws are distinct is (n-1)/n -- so Friedman's method prefers larger values of n, without limit. If I draw 2 distinct coupons, Friedman says it is more likely that there are two million coupons than that there are six. Something is wrong here; I apologize that I have not been able to point out the problem more coherently. On Wed, Feb 3, 2016 at 1:55 PM, Erich Friedman <erichfriedman68@gmail.com> wrote:
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?
i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach.
assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability.
but i don't know how to approach the asymptotics.
erich _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
i should have mentioned that this maximal likelihood approach only works if you get at least two different types of coupons. otherwise, it gives reasonable answers. for example, try k=2 distinct coupons out of m=3 coupons collected. if you don't like the idea of 2 million coupons being "likely", then you have to set some a priori distribution of some smaller number of coupons, and abandon his approach. erich
On Feb 3, 2016, at 2:34 PM, Allan Wechsler <acwacw@gmail.com> wrote:
This cannot be right -- something is not getting defined properly. Let's look at Fred's first question: having drawn m coupons with k distinct, what is the probability that there are n available values? Suppose m=k=1! If we follow Erich Friedman's suggestion, we get a probability of 1 for all values of n!
A little less flippantly: suppose m=k=2. We have drawn two distinct coupons; In Friedman's method, the probability of this outcome in a 1-coupon universe is 0; in a 2-coupon universe it's 1/2. In a 3-coupon universe it's 2/3. In a universe with n coupons, the probability that the first two draws are distinct is (n-1)/n -- so Friedman's method prefers larger values of n, without limit. If I draw 2 distinct coupons, Friedman says it is more likely that there are two million coupons than that there are six. Something is wrong here; I apologize that I have not been able to point out the problem more coherently.
On Wed, Feb 3, 2016 at 1:55 PM, Erich Friedman <erichfriedman68@gmail.com> wrote:
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct. What is the most likely value of n ?
i give a similar problem in my intro prob class. i would approach this with a maximal likelihood approach.
assuming each coupon is equally likely to appear. for a given value of n, calculate the combinatorial probability of getting what you have collected. then choose the value of n that maximizes this probability.
but i don't know how to approach the asymptotics.
erich _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether. (Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.) Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R. Also, rename the points according to their order: x_(1) < x_(2) < ... < x_(n) . Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t. So the density of the maximum x_(n) is d_max(t) = n t^(n-1), so its expected value is E(x_(n)) = Integral_{0<=t<=1} n t^n dt = n/(n+1) . Symmetrically, the expected value of x_(1) = min{x_j} is E(x_(1)) = 1/(n+1) . The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is (*) d_min(t) = n (1-t)^(n-1). The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z. So the setup is the same as n+1 points uniformly distributed on C. Therefore by symmetry we can conclude that the length of each interval between successive points x_(1) < x_(2) < ... < x_(n) has the same density (*) as does x_(1) = x_(1) - 0. Now suppose we don't know the length L of the original interval, which we assume to be [0,L]. ((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of ML(n) = L/(x_(n) - x_(1)). This can be used to infer L as L = approx. ML(n)* (x_(n) - x_(1)). ))) BUT: For now, back to the unit interval [0,1]: Wikipedia states that the joint density of u = x_(1) and v = x_(n) is f(u,v) = (n! / (n-2)!) (v-u)^(n-2) At this point I have to go; maybe more later. —Dan
On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
. . .
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.
What is the probability that n distinct coupons are available?
What is the most likely value of n ?
What are asymptotic expressions for large m ?
I had heard Dan's serial-number anecdote as well, though my informant claimed it related to rockets. [ I did often wonder how one retrieved such data from the exploded fragments, but supposed the missiles might often fail to detonate ... ] And I had that in mind when posing this problem. But here we are sampling with replacement; and no distance measure seems available for coupons! Is the problem well posed, doubts Alan. Well, the expected number of trials to acquire all n coupons asymptotically equals n log n --- a result which, while still a student, I succeeded in solving at the cost of several pages of convoluted algebra, which I later proudly showed to an older colleague. Who proceeded to derive the same result in a single line. A similar fate has overtaken most of my forays into probability. Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree? Fred Lunnon On 2/3/16, Dan Asimov <dasimov@earthlink.net> wrote:
I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether.
(Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.)
Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R.
Also, rename the points according to their order:
x_(1) < x_(2) < ... < x_(n)
.
Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t.
So the density of the maximum x_(n) is
d_max(t) = n t^(n-1),
so its expected value is
E(x_(n)) = Integral_{0<=t<=1} n t^n dt
= n/(n+1)
. Symmetrically, the expected value of x_(1) = min{x_j} is
E(x_(1)) = 1/(n+1)
. The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is
(*) d_min(t) = n (1-t)^(n-1).
The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z.
So the setup is the same as n+1 points uniformly distributed on C.
Therefore by symmetry we can conclude that the length of each interval between successive points
x_(1) < x_(2) < ... < x_(n)
has the same density (*) as does x_(1) = x_(1) - 0.
Now suppose we don't know the length L of the original interval, which we assume to be [0,L].
((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of
ML(n) = L/(x_(n) - x_(1)).
This can be used to infer L as
L = approx. ML(n)* (x_(n) - x_(1)). )))
BUT: For now, back to the unit interval [0,1]:
Wikipedia states that the joint density
of u = x_(1) and v = x_(n) is
f(u,v) = (n! / (n-2)!) (v-u)^(n-2)
At this point I have to go; maybe more later.
—Dan
On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
. . .
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.
What is the probability that n distinct coupons are available?
What is the most likely value of n ?
What are asymptotic expressions for large m ?
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
https://en.wikipedia.org/wiki/German_tank_problem On Wed, Feb 3, 2016 at 3:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
I had heard Dan's serial-number anecdote as well, though my informant claimed it related to rockets. [ I did often wonder how one retrieved such data from the exploded fragments, but supposed the missiles might often fail to detonate ... ] And I had that in mind when posing this problem. But here we are sampling with replacement; and no distance measure seems available for coupons!
Is the problem well posed, doubts Alan. Well, the expected number of trials to acquire all n coupons asymptotically equals n log n --- a result which, while still a student, I succeeded in solving at the cost of several pages of convoluted algebra, which I later proudly showed to an older colleague. Who proceeded to derive the same result in a single line. A similar fate has overtaken most of my forays into probability.
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?
Fred Lunnon
On 2/3/16, Dan Asimov <dasimov@earthlink.net> wrote:
I'm just going to wing it here, but I think this is a problem that was important in W.W. I, where occasionally a German tank would be captured and its serial number noted. Eventually this information was used to estimate how many tanks they had altogether.
(Here I use a well-known mathematician's trick dating back millennia: If you don't know the answer to a problem, solve a different one.)
Suppose we pick n random points {x_j} = x_1,...,x_n from an interval [0,1] in R.
Also, rename the points according to their order:
x_(1) < x_(2) < ... < x_(n)
.
Therefore, the probability that the maximum point x_(n) on [0,1] is <= t (t in [0,1]) is given by t^n, the probability that all n points lie at or to the left of t.
So the density of the maximum x_(n) is
d_max(t) = n t^(n-1),
so its expected value is
E(x_(n)) = Integral_{0<=t<=1} n t^n dt
= n/(n+1)
. Symmetrically, the expected value of x_(1) = min{x_j} is
E(x_(1)) = 1/(n+1)
. The probability that x_(1) >= t is given by (1-t)^n (all x_j are at least t). So Prob(x_(1) <= t) is 1 - (1-t)^n and so its density is
(*) d_min(t) = n (1-t)^(n-1).
The n+1 points consisting of the n random points plus the point 0 := (0 ~ 1) may be thought of as uniformly distributed on the resulting circle C = R/Z.
So the setup is the same as n+1 points uniformly distributed on C.
Therefore by symmetry we can conclude that the length of each interval between successive points
x_(1) < x_(2) < ... < x_(n)
has the same density (*) as does x_(1) = x_(1) - 0.
Now suppose we don't know the length L of the original interval, which we assume to be [0,L].
((( We want the joint distribution of x_(1) and x_(n) to infer the maximum likelihood value of
ML(n) = L/(x_(n) - x_(1)).
This can be used to infer L as
L = approx. ML(n)* (x_(n) - x_(1)). )))
BUT: For now, back to the unit interval [0,1]:
Wikipedia states that the joint density
of u = x_(1) and v = x_(n) is
f(u,v) = (n! / (n-2)!) (v-u)^(n-2)
At this point I have to go; maybe more later.
—Dan
On Feb 3, 2016, at 9:13 AM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
. . .
I don't know in advance how many distinct coupons are available, but have collected m among which there are just k distinct.
What is the probability that n distinct coupons are available?
What is the most likely value of n ?
What are asymptotic expressions for large m ?
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com
Aha, so it was W.W. II, not W.W. I. Hey, I was only off by 1. —Dan
On Feb 3, 2016, at 4:02 PM, Mike Stay <metaweta@gmail.com> wrote:
https://en.wikipedia.org/wiki/German_tank_problem <https://en.wikipedia.org/wiki/German_tank_problem>
On 03/02/2016 23:46, Fred Lunnon wrote:
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?
With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results. (But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.) -- g
I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:
On 03/02/2016 23:46, Fred Lunnon wrote:
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?
With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.
(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)
-- g
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
On Wed, Feb 3, 2016 at 4:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL
Well, there's no uniform distribution on a countably infinite set, so you have to choose some other one. May I suggest the universal prior from Kolmogorov complexity?
On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:
On 03/02/2016 23:46, Fred Lunnon wrote:
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?
With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.
(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)
-- g
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com
Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 . The fact that classical probability cannot cope with a uniform denumerable distribution does not preclude the existence of a perfectly definite limiting answer as upper bound -> infinity. Fred Lunnon On 2/4/16, Mike Stay <metaweta@gmail.com> wrote:
On Wed, Feb 3, 2016 at 4:46 PM, Fred Lunnon <fred.lunnon@gmail.com> wrote:
I have no prior favouring any n ; all I know is that n is finite (rather than denumerably infinite). WFL
Well, there's no uniform distribution on a countably infinite set, so you have to choose some other one. May I suggest the universal prior from Kolmogorov complexity?
On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:
On 03/02/2016 23:46, Fred Lunnon wrote:
Ahem, I digress. Now I argue that if a trial reaches reaches k distinct coupons in m > c k log k trials, where (say) c = 2, or 10, then it's a damned good bet that n = k . Anybody disagree?
With a strong enough prior favouring a larger n, you will prefer the hypothesis that you just saw very improbable results.
(But you probably should never actually *have* a strong enough prior for that; if m and c are large enough, someone who thought they had such a prior would be saying "oops, I chose the wrong model", which means that in some sense their prior didn't favour large n so strongly after all.)
-- g
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Mike Stay - metaweta@gmail.com http://www.cs.auckland.ac.nz/~mike http://reperiendi.wordpress.com
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
On 04/02/2016 02:44, Fred Lunnon wrote:
Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 .
I think this (suitably generalized) is probably correct. (I haven't actually done any of the relevant calculations.) It's equivalent to max likelihood. But Allan's original claim wasn't that max likelihood doesn't give a definite answer, it was that since you may have prior opinions about how likely any given n is, you need to take them into account when choosing an n in light of the evidence. And for that there's no particular reason why you should use max-likelihood -- and the fact that for k=m max-likelihood says bigger n are always better seems sufficient to indicate that it isn't a good general answer. -- g
Allan complained that maximum likelihood breaks down when k = m (every coupon in sample occurring once). However this behaviour is perfectly appropriate, since such a sample is obviously too small to yield any further information, other than " m is too small" ! Now I (think that I) at last have a usable expression for the probability that just n-l (ell) from n coupons occur in m trials: p(n, l, m) = SUM_{l <= i <= n} (-1)^(l-i) (1 - i/n)^m ( n! / (n-i)! (i-l)! l! ) --- it would be much appreciated if some kind soul could validate this! Feeding in my somewhat parsimonious data yields approx. p(4, 0, 18) = 0.98 , p(5, 1, 18) = 0.088 , p(6, 2, 18) = 0.0099 , p(7, 3, 18) = 0.0014 , ... The probabilities drop off exponentially at first, though eventually their successive ratio approaches unity. If I have correctly understood the maximum likelihood principle, this constitutes statistically clear evidence that n = 4 in this case; though at this stage I'm not entirely sure how best to attach a numerical indicator of significance to these numbers. Fred Lunnon On 2/4/16, Gareth McCaughan <gareth.mccaughan@pobox.com> wrote:
On 04/02/2016 02:44, Fred Lunnon wrote:
Without some explicit expressions to discuss, I'm grasping at air here. But I find it difficult to credit that assuming a uniform distribution of n in (say) [1..100] would result in an estimated n substantially different from [1..10^10] , when --- in my present experiment --- I had k = 4 coupons at m = 7 trials, and k remains unchanged at m = 17 .
I think this (suitably generalized) is probably correct. (I haven't actually done any of the relevant calculations.) It's equivalent to max likelihood.
But Allan's original claim wasn't that max likelihood doesn't give a definite answer, it was that since you may have prior opinions about how likely any given n is, you need to take them into account when choosing an n in light of the evidence. And for that there's no particular reason why you should use max-likelihood -- and the fact that for k=m max-likelihood says bigger n are always better seems sufficient to indicate that it isn't a good general answer.
-- g
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (7)
-
Allan Wechsler -
Dan Asimov -
Dan Asimov -
Erich Friedman -
Fred Lunnon -
Gareth McCaughan -
Mike Stay