views:

101

answers:

3

Assume we're frequently sampling a particular value and want to keep statistics on the samples. The simplest approach is to store every sample so we can calculate whatever stats we want, but this requires unbounded storage. Using a constant amount of storage, we can keep track of some stats like minimum and maximum values. What else can we track using only constant storage? I am thinking of percentiles, standard deviation, and any other useful statistics.

That's the theoretical question. In my actual situation, the samples are simply millisecond timings: profiling information for a long-running application. There will be millions of samples but not much more than a billion or so. So what stats can be kept for the samples using no more than, say, 10 variables?

+4  A: 

Minimum, maximum, average, total count, variance are all easy and useful. That's 5 values. Usually you'll store sum and not average, and when you need the average you can just divide the sum by the count.

So, in your loop

maxVal=max(x, maxVal);
minVal=min(x, minVal);
count+=1;
sum+=x;
secondorder+=x*x;

later, you may print any of these stats. Mean and standard deviation can be computed at any time and are:

mean=sum/count;
std=sqrt(secondorder/count - mean*mean);

Median and percentile estimation are more difficult, but possible. The usual trick is to make a set of histogram bins and fill the bins when a sample is found inside them. You can then estimate median and such by looking at the distribution of those bin populations. This is only an approximation to the distribution, but often enough. To find the exact median, you must store all samples.

SPWorley
Technically, total count requires O(n) storage, not constant storage, as does the sum.
John Feminella
yeah, but the original question makes it clear here's a bound on the number of samples, so that will be _O(1)_.
Charlie Martin
O(n)? No. There is one count, one sum, one min, one max. These don't depend on the number of samples. There are O(1) in space.
dmckee
Rather than variance, store the summed-of-squares (because you don't know the mean yet...). Likewise for higher moments if desired.
dmckee
Ah...I see you got that while I was typing. Good catch.
dmckee
And one year later I finally had time to review and accept this answer. How did I forget that the variance is equal to the mean of the squares minus the square of the mean? http://en.wikipedia.org/wiki/Variance#Properties
Bennett McElwee
A: 
Charlie Martin
+1  A: 

Here is an article that answers your question: Accurately computing running variance.

Beware that there are several ways to compute variance or standard deviation that are equivalent in exact arithmetic but not in computers using finite precision. Some algorithms that work well in theory are inaccurate in practice. The article above computes a running variance, i.e. constant memory, and does so accurately.

To compute percentiles, your memory requirements are going to vary with your data set size. If you want, for example, the 5th and 95th percentiles, your memory requirements will be proportional to your data set size. But if you just want, say, the 5th or 95th smallest elements, then you can do that in constant memory. See the CodeProject article Calculating percentiles in memory-bound applications.

John D. Cook
Excellent link. If I have TAOCP on my desk I would have looked it up.
Bennett McElwee