Mean value and standard deviation of a very huge data set

Question 1

Mean value and standard deviation of a very huge data set

algorithm statistics mean numerics large-data

Alfred Zhong · Apr 28, 2012 · Viewed 31.2k times · Source

Answer

Answer

[did the question change? maybe i only read the start. i have updated/edited to give a better reply:]

there is no perfect solution (in constant memory) that i know of, but i can give various approaches.

first, for the basic calculation you only need the sum of all values (sum_x), the sum of their squares (sum_x2), and the total count (n). then:

mean = sum_x / n
stdev = sqrt( sum_x2/n - mean^2 )

and all these values (sum_x, sum_x2, n) can be updated from a stream.

the problem (as you say) is dealing with overflow and / or limited precision. you can see this if you consider floating point when sum_x2 is so large that the internal representation doesn't include values of the magnitude of a single squared value.

a simple way to avoid the problem is to use exact arithmetic, but that will be increasingly slow (and also uses O(log(n)) memory).

another way that can help is to normalise values - if you know that values are typically X then you can do calculations on x - X which makes the sums smaller (obviously you then add X to the mean!). that helps postpone the point at which precision is lost (and can/should be combined with any other method here - when binning, for example, you can use the mean of the previous bin). see this algorithm (knuth's method) for how to do this progressively.

if you don't mind a (small constant factor) O(n) memory cost you could restart every N values (eg million - smarter still would be to adapt this value by detecting when precision is too low), storing previous mean and stdev and then combine for the final result (so your mean is the appropriately weighted value from the recent running total and the old binned values).

the binning approach could probably be generalised to multiple levels (you start binning the bins) and would reduce to O(log(n)) memory use, but i haven't worked out the details.

finally, a more practical solution would probably be to do the initial approach for, say, 1000 values, and then start a new sum in parallel. you could display a weighted average of the two and, after another 1000 values, drop the old sums (after gradually decreasing their weight) and start a new set. so you always have two sets of sums, and display a weighted average between them, which gives continuous data that reflect the last 1000 values (only). in some cases that will be good enough, i imagine (it's not an exact value, since it's only for "recent" data, but it's smooth and representative, and uses a fixed amount of memory).

ps, something that occurred to me later - really, doing this "forever" doesn't make much sense anyway, because you'll get to a point where the values are absolutely dominated by the old data. it would be better to use a "running mean" which gives decreased weight to old values. see for example https://en.wikipedia.org/wiki/Moving_average - however, i don't know of a common equivalent to stdev.

Question 2

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set.

for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical data. Whenever a new value come, update the mean and stdev? Because the data is too big to store, I hope it can just update the mean and stdev on the fly without storing the data.

Even data is stored, the standard way (d1+...+dn)/n, doesn't work, the sum will blow out the data representation.

I through about sum(d1/n + d2/n + ... d3/n), if n is hugh, the error is too big and accumulated. Besides, n is unbound in this case.

The number of data is definitely unbound, whenever it comes, it requires to update the value.

Does anybody know if there is an algorithm for it?

Mean value and standard deviation of a very huge data set

Answer

Related questions