[math-fun] how many records? Asymptotic normality and review
I claim that the number R of record-high entries in a left-right-scan of a random-ordered list of N distinct numbers, is ASYMPTOTICALLY (for N large) NORMALLY DISTRIBUTED. In this post I will sketch a proof. To first recall the preceding progress, I found a disgusting proof (too horrible to describe) that expect(R) = harmonic(N,1) variance(R) = harmonic(N,1) - harmonic(N,2) where harmonic(n,k) = 1 + 1/2^k + 1/3^k + ... + 1/n^k. It turns out these are not new (even though NIST Stirling number fact compendium is ignorant of all this...) and were known to Alfred Renyi in the early 1960s, as described in Ned Glick: Breaking records and breaking boards, Amer. Mathematical Monthly 85 (Jan. 1978) 2-26, on page 14. Renyi also recognized the connection to Stirling numbers. http://www.isse.ucar.edu/extremevalues/nedglick.pdf Then Cris Moore found a nice proof method, which shows 1: E[R] = H(N,1) 2: E[R(R-1)] = H(N,1)^2 - H(N,2) 3: E[R(R-1)(R-2)] = H(N,1)^3 - 3 H(N,1) H(N,2) + 2 H(N,3) 4: E[R(R-1)(R-2)(R-3)] = H(N,1)^4 - 6 H(N,1)^2 H(N,2) + 3 H(N,2)^2 + 8 H(N,1) H(N,3) - 6 H(N,4). where H(n,k)=harmonic(n,k).... and in principle Moore's techniques allow one to work out any "falling factorial moment", these being just the first 4 cases. Unfortunately it gets more and more mysterious fast what the Moorish formulae are as the order increases. However, it should be pretty obvious how to generalize the above 4 statements to any order provided we just say "for some unspecified coefficients". I.e. in claim 4, the coefficients 1,-6,3,8,-6 are mysterious, but Moore's method makes it clear that suitable integer coefficients always exist at any order, and at present the best way I know to find them at orders>4 is simply to regard them as undetermined and solve for them (using exact Gaussian elimination) to exactly fit data, which in turn is computed exactly for (enough) small N by brute force. OK, now to prove the asymptotic normality claim (the normal has the mean and variance formulas above). The idea is to partition the N numbers chronologically into "eras" where each successive era increases the total cardinality so far (including both it and all previous eras) by a factor of e. The count of records within each era is INDEPENDENT of the number in every other era. Furthermore, the exact mean and variance of the count (for each era) is exactly known. The counts within the eras are almost-identically-distibuted random variables. Specifically, asymptotically for high eras, the means are asymptotic to 1 and the variances also to 1. If "almost identically" were "identically" we would now be done because of the central limit theorem for sums of i.i.d. random variables with variance=1. We need a form of the central limit theorem valid for not-quite-identically distributed random variables. Upon reading this web page http://www.johndcook.com/central_limit_theorems.html by John D.Cook surveying such central limit theorems, his formulation of "Lyapunov's CLT" (although NOT wikipedia's lamer formulation of Lyapunov's CLT) seems to do the job, in view of the bounds on 4th moments arising from Moorism. To make a precise claim, (R-ln(N))/sqrt(ln(N)) is an asymptotically (when N-->infinity) normally distributed random variable with mean=0 and variance=1. Note that this asymptotic normality claim converges a lot slower than it usually does since it depends on ln(N) being large, vs usually N being large suffices. Now once we know this, it is trivial to write down asymptotic formulae for every (mean-centered) power-moment etc (since those all are exactly known for normal curve, computing them is just Euler's gamma function integral in disguise). And this also gives us a great deal of understanding of the asymptotic behavior of the Stirling numbers of the first kind (which again NIST compendium is ignorant of). But (annoyingly) the asymptotic normality claim also is not new and is stated by Glick on page 15 and again attributed to Renyi 1962. So, not much newness here. Moore's path to understanding is nicer than Glick's, I'll say that. -- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)
Since the events that the i'th entry is a record are independent, doesn't this follow from a central limit like theorem? The number of records is a sum of independent binomial random variables, Bin(1,1/i). On Apr 17, 2013, at 10:18 AM, Warren D Smith wrote:
I claim that the number R of record-high entries in a left-right-scan of a random-ordered list of N distinct numbers, is ASYMPTOTICALLY (for N large) NORMALLY DISTRIBUTED.
Cristopher Moore Professor, Santa Fe Institute
participants (2)
-
Cris Moore -
Warren D Smith