New subject: [math-fun] how many records? Asymptotic normality and review

17 Apr 2013

      I claim that the number R of record-high entries in a left-right-scan
of a random-ordered list of N distinct numbers, is ASYMPTOTICALLY (for
N large) NORMALLY DISTRIBUTED.

In this post I will sketch a proof.
To first recall the preceding progress, I found a disgusting proof (too horrible
to describe) that

 expect(R) = harmonic(N,1)

 variance(R) = harmonic(N,1) - harmonic(N,2)

where harmonic(n,k) = 1 + 1/2^k + 1/3^k + ... + 1/n^k.
It turns out these are not new (even though NIST Stirling number fact
compendium is ignorant of all this...) and were known to Alfred Renyi
in the early 1960s, as described in
Ned Glick: Breaking records and breaking boards, Amer. Mathematical
Monthly 85 (Jan. 1978) 2-26, on page 14. Renyi also recognized the
connection to Stirling numbers.
http://www.isse.ucar.edu/extremevalues/nedglick.pdf

Then Cris Moore found a nice proof method, which shows
1: E[R] = H(N,1)
2: E[R(R-1)] = H(N,1)^2 - H(N,2)
3: E[R(R-1)(R-2)] = H(N,1)^3 - 3 H(N,1) H(N,2) + 2 H(N,3)
4: E[R(R-1)(R-2)(R-3)] = H(N,1)^4 - 6 H(N,1)^2 H(N,2)
+  3 H(N,2)^2 + 8 H(N,1) H(N,3) - 6 H(N,4).
where H(n,k)=harmonic(n,k).... and in principle Moore's techniques
allow one to work out any "falling factorial moment", these being just
the first 4 cases.  Unfortunately it gets more and more mysterious
fast what the Moorish formulae
are as the order increases.  However, it should be pretty obvious how
to generalize
the above 4 statements to any order provided we just say "for some unspecified
coefficients".  I.e. in claim 4, the coefficients 1,-6,3,8,-6 are
mysterious, but Moore's
method makes it clear that suitable integer coefficients always exist
at any order,
and at present the best way I know to find them at orders>4 is simply
to regard them as undetermined and solve for them (using exact
Gaussian elimination) to exactly fit data, which in turn is computed
exactly for (enough) small N by brute force.

OK, now to prove the asymptotic normality claim (the normal has the
mean and variance formulas above).  The idea is to partition the N
numbers chronologically into "eras"
where each successive era increases the total cardinality so far
(including both it
and all previous eras) by a factor of e.   The count of records within each era
is INDEPENDENT of the number in every other era.  Furthermore, the exact mean
and variance of the count (for each era) is exactly known.   The counts within
the eras are almost-identically-distibuted random variables.
Specifically, asymptotically
for high eras, the means are asymptotic to 1 and the variances also to 1.

If "almost identically" were "identically" we would now be done
because of the central
limit theorem for sums of  i.i.d. random variables with variance=1.

We need a form of the central limit theorem valid for
not-quite-identically distributed random variables.  Upon reading this
web page
   http://www.johndcook.com/central_limit_theorems.html
by John D.Cook surveying such central limit theorems,
his formulation of "Lyapunov's CLT" (although NOT wikipedia's lamer
formulation of Lyapunov's CLT) seems to do the job, in view of
the bounds on 4th moments arising from Moorism.

To make a precise claim,  (R-ln(N))/sqrt(ln(N))
is an asymptotically (when N-->infinity)
normally distributed random variable with mean=0 and variance=1.

Note that this asymptotic normality claim converges a lot slower than
it usually does
since it depends on ln(N) being large, vs usually N being large suffices.

Now once we know this, it is trivial to write down asymptotic formulae
for every (mean-centered) power-moment etc  (since those all are
exactly known for normal curve,
computing them is just Euler's gamma function integral in disguise).
And this also gives us a great deal of understanding of the asymptotic
behavior of the Stirling numbers of the first kind (which again NIST
compendium is ignorant of).

But (annoyingly) the asymptotic normality claim also is not new and is
stated by Glick on page 15 and again attributed to Renyi 1962.

So, not much newness here. Moore's path to understanding is nicer than
Glick's, I'll
say that.

-- 
Warren D. Smith
http://RangeVoting.org  <-- add your endorsement (by clicking
"endorse" as 1st step)

[math-fun] how many records? Asymptotic normality and review

Warren D Smith

Cris Moore

tags

participants (2)