Re: [math-fun] Calculating the standard deviation incrementally
I haven't been following this thread that carefully, so the following point may already have been made. The computation of the variance below involves the addition/subtraction of very large values, which can result in massive cancellations. Thus, though while the formulae work, you may have to carry very large precisions. In fact, even the simple summation of the values themselves is a serious problem, as I discovered in a real application many years ago, and which I passed on to my students in the following form: for i=0 to 1 by 0.000000001 do <loop> I believe that if i is single precision IEEE float, the increment is washed out well before hitting 1.0, which means that this loop is actually an infinite loop. (In the old days, computers were so slow that loops of this sort appeared to be infinite in any case! Large scientific calculations need the precision to go up with the speed of the computer, so that naive computations like this still "work".) A very good summation is to remember all the values, sort them by absolute value, and then add them in terms of increasing value. An almost-as-good method is to keep log2(n) levels of sums, and sum as a binary tree. I seem to recall that the CACM (Comms. of the ACM) had a number of articles about the "real-time" standard deviation problem back in the late 1960's - early 1970's time frame. The index of this publication is online and freely accessible at acm.org. At 10:17 PM 7/16/03 -0600, Roland Silver wrote:
Bernie Cosell, All you need to do is compute on the fly the sum of squares S = x[1]^2 + ... + x[n]^2 of the empirical values, as well as the sum of values T = x[1] + ... + x[n]. The mean M = T/n, and the variance V = ((x[1] - M)^2 + (x[n] - M)^2)/(n-1). The SD is the square root of V. If you expand the terms in the equation for V, you get V = (S -2*M*T + n*M^2)/(n-1) = (S - n*M^2)/(n-1), which you can compute in one pass.
At 05:37 AM 7/17/2003, Henry Baker wrote:
The computation of the variance below involves the addition/subtraction of very large values, which can result in massive cancellations. Thus, though while the formulae work, you may have to carry very large precisions.
I did a test of this, and problems can occur if the standard deviation is very small. I tested with 1000 data points generated with a mean of 1 and a small standard deviation. For double precision, a SD of 10^-6 is OK, 10^-7 gives problems. For extended precision, a SD of 10^-8 is OK, 10^-9 gives problems. So the std. dev. of the data set has to be pretty small to cause problems.
When the mean is much larger than the standard deviation, there is a risk that the subtraction <x^2> - <x>^2 will introduce cancellation errors. The comutationally simplest method for eliminating this error is to offset the data by some estimate of the mean, say by the first datum x[1], when accumulating the sums of x[i] and x[i]^2, and then adding the offset back to the calculated mean at the end. __________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
participants (3)
-
Eugene Salamin -
Henry Baker -
Jud McCranie