[math-fun] Calculating the standard deviation incrementally
I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly", but I can't remember any details of any such trickery and a quick Google search only came up with algorithms that I already know and that start with the obvious "calculate the mean..." Basically, I have a largish data set that it will only be convenient to read once and I'd like to be able to calculate [or at least estimate] the SD as the data comes by and avoid having to scroll it into a temp file and then do a second pass over the data from the temp file... Is there such an algorithm or am I just misremembering? Thanks! /Bernie\ -- Bernie Cosell Fantasy Farm Fibers mailto:bernie@fantasyfarm.com Pearisburg, VA --> Too many people, too few sheep <--
Thu, 10 Jul 2003 11:20:52 -0400 "Bernie Cosell" <bernie@fantasyfarm.com> I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly", ... Basically, I have a largish data set that it will only be convenient to read once and I'd like to be able to calculate [or at least estimate] the SD as the data comes by and avoid having to scroll it into a temp file and then do a second pass over the data from the temp file... Is there such an algorithm or am I just misremembering? I am probably misunderstandinng what you mean, but if you keep the running sum of the elements, and a running sum of the squares, then you can get the SD at any time, in one pass with the following (well known, simple) algorithm: (I tried to use a form of pseudo-code that isn't partial to any particular "real" language) NewDatum(stream, datum) stream.count = stream.count + 1; stream.sum = stream.sum + datum; stream.squareSum = stream.squareSum + datum^2; end StandardDeviation(stream) return(sqrt(stream.squareSum/stream.count - (stream.sum/stream.count)^2)); end
This is mostly correct, except the computation for standard deviation is StandardDeviation(stream) return(sqrt( (stream.squareSum - (stream.sum)^2/stream.count) / (stream.count - 1) )) When computing standard deviation, you divide by n if you know the mean a-priori, and you divide by n-1 if you are calculating the mean from the data. J.P. Grossman On Thu, 10 Jul 2003, Michael B Greenwald wrote:
Thu, 10 Jul 2003 11:20:52 -0400 "Bernie Cosell" <bernie@fantasyfarm.com>
I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly", ... Basically, I have a largish data set that it will only be convenient to read once and I'd like to be able to calculate [or at least estimate] the SD as the data comes by and avoid having to scroll it into a temp file and then do a second pass over the data from the temp file... Is there such an algorithm or am I just misremembering?
I am probably misunderstandinng what you mean, but if you keep the running sum of the elements, and a running sum of the squares, then you can get the SD at any time, in one pass with the following (well known, simple) algorithm: (I tried to use a form of pseudo-code that isn't partial to any particular "real" language)
NewDatum(stream, datum) stream.count = stream.count + 1; stream.sum = stream.sum + datum; stream.squareSum = stream.squareSum + datum^2; end
StandardDeviation(stream) return(sqrt(stream.squareSum/stream.count - (stream.sum/stream.count)^2)); end
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
At 11:36 AM -0400 7/10/03, JP Grossman wrote:
This is mostly correct, except the computation for standard deviation is
StandardDeviation(stream) return(sqrt( (stream.squareSum - (stream.sum)^2/stream.count) / (stream.count - 1) ))
When computing standard deviation, you divide by n if you know the mean a-priori, and you divide by n-1 if you are calculating the mean from the data.
Dividing by n-1 (and not taking the square root) gives you an unbiased estimate of the variance. I'm not familiar with any formulas for unbiased estimate of the standard deviation (this would be an interesting exercise!). Keep in mind that unbiased estimators are not necessarily superior to other estimators. There is a nice demonstration of this in the March 2003 American Math Monthly: "An Illuminating Counterexample" by Michael Hardy, p. 234-238. Paul
At 11:20 AM 7/10/2003, Bernie Cosell wrote:
I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly",
temp file... Is there such an algorithm or am I just misremembering?
Yes, it is called the calculating formula. Let sumX be the sum of the n data points and sumX2 be the sum of their squares. Then sqrt( (n*sumX2 - (sumX)^2) / (n*(n-1)) ) is the S.D.
There's an archaic trick that people used for keeping running tallies of all the information needed for SD in the days of mechanical calculators. If a measurement is, say, 239.87, you enter something like 10000239.87, square it, then add to the previous sum. You end up with a long number that has N (the sample size) followed by a bunch of 0's, followed by twice the sum of the data, followed by a bunch of 0's, followed by the sum of the squares of the data. I've forgotten the capabilities of the mechanical calculators, but I think when you wanted the actual SD there was a way to get it without re-entering the data --- probably doing square roots by iterating divide and average. There's a related trick for doing the running tallies needed for correlations. I did a bunch of this as an undergraduate, as well as factor analysis, on a mechanical calculator, around 1965 or so. It's a lot easier now! Bill On Thursday, July 10, 2003, at 11:20 AM, Bernie Cosell wrote:
I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly", but I can't remember any details of any such trickery and a quick Google search only came up with algorithms that I already know and that start with the obvious "calculate the mean..." Basically, I have a largish data set that it will only be convenient to read once and I'd like to be able to calculate [or at least estimate] the SD as the data comes by and avoid having to scroll it into a temp file and then do a second pass over the data from the temp file... Is there such an algorithm or am I just misremembering?
Thanks!
/Bernie\
-- Bernie Cosell Fantasy Farm Fibers mailto:bernie@fantasyfarm.com Pearisburg, VA --> Too many people, too few sheep <--
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Bernie Cosell, All you need to do is compute on the fly the sum of squares S = x[1]^2 + ... + x[n]^2 of the empirical values, as well as the sum of values T = x[1] + ... + x[n]. The mean M = T/n, and the variance V = ((x[1] - M)^2 + (x[n] - M)^2)/(n-1). The SD is the square root of V. If you expand the terms in the equation for V, you get V = (S -2*M*T + n*M^2)/(n-1) = (S - n*M^2)/(n-1), which you can compute in one pass. On Thursday, Jul 10, 2003, at 09:20 America/Denver, Bernie Cosell wrote:
I thought *SURE* I knew an algorithm once, long ago, for calculating the SD "on the fly", but I can't remember any details of any such trickery and a quick Google search only came up with algorithms that I already know and that start with the obvious "calculate the mean..." Basically, I have a largish data set that it will only be convenient to read once and I'd like to be able to calculate [or at least estimate] the SD as the data comes by and avoid having to scroll it into a temp file and then do a second pass over the data from the temp file... Is there such an algorithm or am I just misremembering?
Thanks!
/Bernie\
-- Bernie Cosell Fantasy Farm Fibers mailto:bernie@fantasyfarm.com Pearisburg, VA --> Too many people, too few sheep <--
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Quoting Roland Silver <rollos@starband.net>:
Bernie Cosell, All you need to do is compute on the fly the sum of squares S = x[1]^2 + ... + x[n]^2 of the empirical values, as well as the sum of values T = x[1] + ... + x[n]. The mean M = T/n, and the variance V = ((x[1] - M)^2 + (x[n] - M)^2)/(n-1). The SD is the square root of V. If you expand the terms in the equation for V, you get V = (S -2*M*T + n*M^2)/(n-1) = (S - n*M^2)/(n-1), which you can compute in one pass.
On Thursday, Jul 10, 2003, at 09:20 America/Denver, Bernie Cosell wrote:
... [an inquiry] ...
I don't know why people insist on putting that (n-1) in the denominator of the variance formula. Both mean and variance (as well as second moment} are >averages< of certain quantities, which means dividing by the number of data points, which is n. At first sight, you think you can't accumulate a variance since it is the second moment about the mean, which you don't know yet. But one of Newton's identities saves the day, because the total variance satisfies a quadratic identity with the total mean and the total second moment, both of which can be accumulated term by term, which is more or less the formula cited in these messsages, but with n and not (n-1). To check: the variance of >one< datum is >zero<. That >(n-1)< is a Classical, Canonical source of confusion amongst statisticians, who often don't bother with mathematics, and enshrined in the folklore by a publication of the National Bureau of Standards from some years ago when one turned to the govenrment for help on these mysteries. The reason for the confusion is: The Variance of the Mean is NOT the Mean of the Variance, and if the two quantities are both computed and then compared, the roles of (n) and (n-1) are readily apparent. This is why talking of "estimators" does not exactly answer the question first proposed. - hvm ------------------------------------------------- Obtén tu correo en www.correo.unam.mx UNAMonos Comunicándonos
participants (8)
-
Bernie Cosell -
JP Grossman -
Jud McCranie -
mcintosh@servidor.unam.mx -
Michael B Greenwald -
Paul R. Pudaite -
Roland Silver -
William Thurston