views:

1240

answers:

6

For large n (see below for how to determine what's large enough), it's safe to treat, by the central limit theorem, the distribution of the sample mean as normal (gaussian) but I'd like a procedure that gives a confidence interval for any n. The way to do that is to use a Student T distribution with n-1 degrees of freedom.

So the question is, given a stream of data points that you collect or encounter one at a time, how do you compute a c (eg, c=.95) confidence interval on the mean of the data points (without storing all of the previously encountered data)?

Another way to ask this is: How do you keep track of the first and second moments for a stream of data without storing the whole stream?

BONUS QUESTION: Can you keep track of higher moments without storing the whole stream?

A: 

I think you can. I'd have to Google/Wikipidia for it so I'll leave that as an exercise for the reader.

BCS
+4  A: 

Here's an article on how to compute the mean and standard deviation in a single pass, not storing any data. Once you have the these two statistics, you can estimate a confidence interval. A 95% confidence interval would be [mean - 1.96*stdev, mean + 1.96*stdev], assuming a normal distribution for your data and a large number of data points.

For a smaller number of data points, your confidence interval would be [mean - c(n)stdev, mean + c(n)stdev] where c(n) depends on your sample size and your confidence level. For a 95% confidence level, here are your values of c(n) for n = 2, 3, 4, ..., 30

12.70620, 4.302653, 3.182446, 2.776445, 2.570582, 2.446912, 2.364624, 2.306004, 2.262157, 2.228139, 2.200985, 2.178813, 2.160369, 2.144787, 2.131450, 2.119905, 2.109816, 2.100922, 2.093024, 2.085963, 2.079614, 2.073873, 2.068658, 2.063899, 2.059539, 2.055529, 2.051831, 2.048407, 2.045230

These numbers are g(0.025, n-1) where g is the inverse CDF of the t distribution with n degrees of freedom. If you wanted a 99% confidence interval, replace 0.025 with 0.005. In general, for a confidence level of 1-alpha, use alpha/2.

Here's the R command that generated the constants above.

n = seq(2, 30); qt(0.025, n-1)

Here's a blog post explaining why the numbers above are not as close to 1.96 as you might expect.

John D. Cook
Note that I don't want to assume a normal distribution. That's an excellent article, but I think it's only applicable to the large n case.
dreeves
I'm just now appreciating the brilliance of the technique in that article, and it *is* relevant here, just that there's another step needed, namely, applying the t distribution as David Nehme indicates. Maybe it's best to update the question...
dreeves
I added notes for how to incorporate the t distribution for small samples. This will give you a general solution for all n.
John D. Cook
I think you mean the standard error of the sample mean, not stdev. I made this more explicit in my answer as well.
dreeves
+2  A: 
   sigma = sqrt( (q - (s*s/n)) / (n-1) )
   delta = t(1-c/2,n-1) * sigma / sqrt(n)

Where t(x, n-1) is the t- distribution with n-1 degrees of freedom. if you are using gsl

t = gsl_cdf_tdist_Qinv (c/2.0, n-1)

There's no need to store any data beyond the sum of squares. Now, you might have a numerical issue because the sum-of-squares can be quite large. You could use the alternate definition of s

sigma = sqrt(sum(( x_i - s/n )^2 / (n-1)))

and make two passes. I would encourage you to consider using gnu scientific library or a package like R to help you avoid numerical issues. Also, be careful about your use of the central limit theorem. Abuse of it is partially to blame for the whole financial crisis going on right now.

David Nehme
PS: John D Cooke has provided a better way to keep track of sigma (the sample std deviation). So I think the right way to structure this for posterity is to remove my sum-of-squares presumption from the question and combine yours and John's answers.
dreeves
Done. Also, I think you want the inverse cdf, right? I wrote up my answer in terms of the Euler beta function.
dreeves
Do you mean t((1-c)/2,n-1)? And then t(p,df) = gsl_cdf_tdist_Qinv(p,df)?
dreeves
+1  A: 

You don't want to accumulate the sum-of-squares. The resulting statistics are numerically inaccurate -- you'll end up subtracting two large, similar numbers. You want to maintain the variance, or (n-1)*variance, or something like that.

The straightforward way is to accumulate the datapoints incrementally. The formula is not complicated or hard to derive (see John D. Cook's link).

An even more accurate way to do it is to combine the datapoints pairwise-recursively. You can do this with memory logarithmic in n: register k holds statistics for 2^k older datapoints, which are combined with statistics for 2^k newer points to get statistics for 2^(k+1) points...

comingstorm
I don't understand your "even more accurate way". Y'all have definitely convinced me about not storing the sum of squares though. I'll update the question. If you can say more about the "pairwise-recursive" technique, I'm all ears.
dreeves
Adding small numbers to a large accumulator value loses precision, which can be important, say if n is large and accuracy requirements are strong. If you're worried about small-n confidence intervals, a regular accumulator should be fine -- the big step is moving away from the sum of squares thing.
comingstorm
+2  A: 

[Huge thanks to John D Cook for much of what I learned in putting together this answer!]

First, here's the reason not to use sum-of-squares: http://www.johndcook.com/blog/2008/09/26/

What you should do instead:

Keep track of the count (n), the mean (u), and a quantity (s) from which sample variance and standard error can be determined. (Adapted from http://www.johndcook.com/standard_deviation.html.)

Initialize n = u = s = 0.

For each new datapoint, x:

u0 = u;
n ++;
u += (x - u) / n;
s += (x - u0) * (x - u);

The sample variance is then s/(n-1), the variance of the sample mean is s/(n-1)/n, and the standard error of the sample mean is SE = sqrt(s/(n-1)/n).

It remains to compute the Student-t c-confidence interval (c in (0,1)):

u [plus or minus] SE*g((1-c)/2, n-1)

where g is the inverse cdf (aka quantile) of the Student-t distribution with mean 0 and variance 1, taking a probability and the degrees of freedom (one less than the number of data points):

g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1/2) - 1)

where irib is the inverse regularized incomplete beta function:

irib(s0,s1,a,b) = z such that rib(s0,z,a,b) = s1

where rib is the regularized incomplete beta function:

rib(x0,x1,a,b) = B(x0,x1,a,b) / B(a,b)

where B(a,b) is the Euler beta function and B(x0,x1,a,b) is the incomplete beta function:

B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) = integral_0^1 t^(a-1)*(1-t)^(b-1) dt
B(x0,x1,a,b) = integral_x0^x1 t^(a-1)*(1-t)^(b-1) dt

Typical numerical/statistics libraries will have implementations of the beta function (or the inverse cdf of the Student-t distribution directly). For C, the de facto standard is the Gnu Scientific Library (GSL). Often a 3-argument version of the beta function is given; the generalization to 4 arguments is as follows:

B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b)
rib(x0,x1,a,b) = rib(x1,a,b) - rib(x0,a,b)

Finally, here is an implementation in Mathematica:

(* Take current {n,u,s} and new data point; return new {n,u,s}. *)
update[{n_,u_,s_}, x_] := {n+1, u+(x-u)/(n+1), s+(x-u)(x-(u+(x-u)/(n+1)))}

Needs["HypothesisTesting`"];
g[p_, df_] := InverseCDF[StudentTDistribution[df], p]

(* Mean CI given n,u,s and confidence level c. *)
mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]}, 
  {u+d, u-d}]

Compare to

StudentTCI[u, SE, n-1, ConfidenceLevel->c]

or, when the entire list of data points is available,

MeanCI[list, ConfidenceLevel->c]

Finally, if you don't want to load math libraries for things like the beta function, you can hardcode a lookup table for -g((1-c)/2, n-1). Here it is for c=.95 and n=2..100:

12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934, 2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168, 2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226, 2.1603686564627917, 2.1447866879178012, 2.131449545559774, 2.1199052992212533, 2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626, 2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254, 2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243, 2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976, 2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462, 2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715, 2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627, 2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314, 2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857, 2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696, 2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578, 1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692, 1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386, 1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.993463566661884, 1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793, 1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452, 1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192, 1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672, 1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985, 1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508, 1.9847231860139618, 1.98446745450849, 1.9842169515863888

which is asymptotically approaching the inverse CDF of a normal(0,1) distribution for c=.95, which is:

-sqrt(2)*InverseErf(-c) = 1.959963984540054235524594430520551527955550...

See http://mathworld.wolfram.com/InverseErf.html for the inverse erf() function. Notice that g((1-.95)/2,n-1) doesn't round to 1.96 until there are at least 474 data points. It rounds to 2.0 when there are 29 data points.

As a rule of thumb, you should use Student-t instead of the normal approximation for n up to at least 300, not 30 per conventional wisdom. Cf. http://www.johndcook.com/blog/2008/11/12/.

See also "Improving Compressed Counting" by Ping Li of Cornell.

dreeves
+1  A: 

I think that you don't have to worry so much about the size of n because it will soon exceed the number of 30, where the distribution can be considered as normal. Using Bayesian recursion to make posterior inference on the population mean and variance parameters, assuming a normal model, is I think the best way, if you don't want to store any data points from previous samples. You can take a look at this document for joint inference for the mean and variance, and specifically equations 38a, 38b and 38c.

gd047
Thanks! As for the n=30 threshold, see the second-to-last paragraph in my answer (the one I immodestly marked as accepted).
dreeves