tags:

views:

2096

answers:

14

I have a large set of numbers, probably in the multiple gigabytes range. First issue is that I can't store all of these in memory. Second is that any attempt at addition of these will result in an overflow. I was thinking of using more of a rolling average, but it needs to be accurate. Any ideas?

These are all floating point numbers.

This is not read from a database, it is a CSV file collected from multiple sources. It has to be accurate as it is stored as parts of a second (e.g; 0.293482888929) and a rolling average can be the difference between .2 and .3

It is a set of #'s representing how long users took to respond to certain form actions. For example when showing a messagebox, how long did it take them to press OK or Cancel. The data was sent to me stored as seconds.portions of a second; 1.2347 seconds for example. Converting it to milliseconds and I overflow int, long, etc.. rather quickly. Even if I don't convert it, I still overflow it rather quickly. I guess the one answer below is correct, that maybe I don't have to be 100% accurate, just look within a certain range inside of a sepcific StdDev and I would be close enough.

+12  A: 

Integers or floats?

If they're integers, you need to accumulate a frequency distribution by reading the numbers and recording how many of each value you see. That can be averaged easily.

For floating point, this is a bit of a problem. Given the overall range of the floats, and the actual distribution, you have to work out a bin-size that preserves the accuracy you want without preserving all of the numbers.


Edit

First, you need to sample your data to get a mean and a standard deviation. A few thousand points should be good enough.

Then, you need to determine a respectable range. Folks pick things like ±6σ (standard deviations) around the mean. You'll divide this range into as many buckets as you can stand.

In effect, the number of buckets determines the number of significant digits in your average. So, pick 10,000 or 100,000 buckets to get 4 or 5 digits of precision. Since it's a measurement, odds are good that your measurements only have two or three digits.


Edit

What you'll discover is that the mean of your initial sample is very close to the mean of any other sample. And any sample mean is close to the population mean. You'll note that most (but not all) of your means are with 1 standard deviation of each other.

You should find that your measurement errors and inaccuracies are larger than your standard deviation.

This means that a sample mean is as useful as a population mean.

S.Lott
Sorry, forgot to include that. This is scientific data so it is floats.
A: 

depending on the range of numbers it might be a good idea to have an array where the subscript is your number and the value is the quantity of that number, you could then do your calculation from this

KM
+16  A: 

You can sample randomly from your set ("population") to get an average ("mean"). The accuracy will be determined by how much your samples vary (as determined by "standard deviation" or variance).

The advantage is that you have billions of observations, and you only have to sample a fraction of them to get a decent accuracy or the "confidence range" of your choice. If the conditions are right, this cuts down the amount of work you will be doing.

Here's a numerical library for C# that includes a random sequence generator. Just make a random sequence of numbers that reference indices in your array of elements (from 1 to x, the number of elements in your array). Dereference to get the values, and then calculate your mean and standard deviation.

If you want to test the distribution of your data, consider using the Chi-Squared Fit test or the K-S test, which you'll find in many spreadsheet and statistical packages (e.g., R). That will help confirm whether this approach is usable or not.

Alex Reynolds
Assuming he can judge what confidence range is required, this is good advice. +1
Lucas Lindström
Uniform sampling also assumes that the observations are distributed uniformly. If he has a billion integers and most of them are very small, but a few outliers at the edges are very large, the accuracy of the mean will go down. So it's not perfect, but requires many, many fewer calculations if the conditions are right.
Alex Reynolds
+1, political pollsters only interview a 1/100,000th of the population and still usually get it right!
Daniel Earwicker
Whether this works totally depends on the distribution. (Not all distributions are Gaussian and standard deviation may mean nothing with this data set.)
tom10
This is entirely correct and worth pointing out. If the data do not fit, this is not the right approach.
Alex Reynolds
You usually get better results by taking the median of the means of several samples.
Dave
@ Alex Reynolds -- The sampling method proposed DOES NOT assume that the observations are distributed uniformly. It assumes that the observations are iid (independently identically distributed). The CLT applies to ALL distributions with finite variance. This distribution almost certainly has finite variance so if you are willing to accept the iid bit, the CLT and the subsampling approach are entirely appropriate.
leif
+1 leif, variance is all that matters here.
Dave
+7  A: 

Wouldn't a rolling average be as accurate as anything else (discounting rounding errors, I mean)? It might be kind of slow because of all the dividing.

You could group batches of numbers and average them recursively. Like average 100 numbers 100 times, then average the result. This would be less thrashing and mostly addition.

In fact, if you added 256 or 512 at once you might be able to bit-shift the result by either 8 or 9, (I believe you could do this in a double by simply changing the floating point mantissa)--this would make your program extremely quick and it could be written recursively in just a few lines of code (not counting the unsafe operation of the mantissa shift).

Perhaps dividing by 256 would already use this optimization? I may have to speed test dividing by 255 vs 256 and see if there is some massive improvement. I'm guessing not.

Bill K
Another benefit of dividing the work up into batches: individual batches could easily run in separate threads for increased performance (assuming you are able to create new batches fast enough to make this worthwhile).
Joel Coehoorn
True, but you have to be careful to ensure that each batch is the same size or retains it's size count, because the final merge (with different "counts") is going to involve some slightly tricky ratio math.
Bill K
A: 

If the numbers are int's, accumulate the total in a long. If the numbers are long's ... what language are you using? In Java you could accumulate the total in a BigInteger, which is an integer which will grow as large as it needs to be. You could always write your own class to reproduce this functionality. The gist of it is just to make an array of integers to hold each "big number". When you add two numbers, loop through starting with the low-order value. If the result of the addition sets the high order bit, clear this bit and carry the one to the next column.

Another option would be to find the average of, say, 1000 numbers at a time. Hold these intermediate results, then when you're done average them all together.

Jay
+4  A: 

You could break the data into sets of, say, 1000 numbers, average these, and then average the averages.

tom10
This is a good first step. Dividing the set size by one thousand, or even one billion may not be sufficient. In order to clear this hurdle, you should set it up as tree. What you described is effectively a tree with only one layer.Second, the set might not be evenly divisible or (gasp) have a prime number of elements. You can clear this hurdle using a weighted average.
Kennet Belenky
OP says he has multiple gigabytes, so if he has 4 GB, he has about 1 billion sample points. Set's of 1000, leaves 1 million averages. This is easy enough to store in memory, and do say, another set of 1000. It's fast and easy. Even with 40 GB. With 1 million averages to average again, the final one not being weighted exactly correctly seems fairly inconsequential, or just leave it off all together. But sure, one could weight it if they wanted to.
tom10
A: 

Why is a sum of floating point numbers overflowing? In order for that to happen, you would need to have values near the max float value, which sounds odd.

If you were dealing with integers I'd suggest using a BigInteger, or breaking the set into multiple subsets, recursively averaging the subsets, then averaging the averages.

If you're dealing with floats, it gets a bit weird. A rolling average could become very inaccurate. I suggest using a rolling average which is only updated when you hit an overflow exception or the end of the set. So effectively dividing the set into non-overflowing sets.

Strilanc
A: 

Two ideas from me:

  • If the numbers are ints, use an arbitrary precision library like IntX - this could be too slow, though
  • If the numbers are floats and you know the total amount, you can divide each entry by that number and add up the result. If you use double, the precision should be sufficient.
Michael Borgwardt
+1  A: 

Here's one way to do it in pseudocode:

average=first
count=1
while more:
  count+=1
  diff=next-average
  average+=diff/count
return average
lostlogic
This isn't a great idea. Dividing by large numbers will introduce unnecessarily large rounding errors.
Kennet Belenky
+6  A: 

You mean of 32-bit and 64-bit numbers. But why not just use a proper Rational Big Num library? If you have so much data and you want an exact mean, then just code it.

class RationalBignum {
    public Bignum Numerator { get; set; }
    public Bignum Denominator { get; set; }
}

class BigMeanr {
    public static int Main(string[] argv) {
        var sum = new RationalBignum(0);
        var n = new Bignum(0);
        using (var s = new FileStream(argv[0])) {
            using (var r = new BinaryReader(s)) {
                try {
                    while (true) {
                        var flt = r.ReadSingle();
                        rat = new RationalBignum(flt);
                        sum += rat;
                        n++;
                    }
                }
                catch (EndOfStreamException) {
                    break;
                }
            }
        }
        Console.WriteLine("The mean is: {0}", sum / n);
    }
}

Just remember, there are more numeric types out there than the ones your compiler offers you.

Frank Krueger
+2  A: 
Joel Coehoorn
while I like the running average idea, I don't like this immplementationI think it would be best to gather a sum for say 100 items, then add those to the running average
Jean-Bernard Pellerin
Good point. The divide and conquer approach is far better here, and I upvoted one of the answers that suggested it. Not only does it avoid the C/C+1 precision problem, but it could easily be done in parallel if required to improve performance as well.
Joel Coehoorn
My math ticks off here - You have C, which you say "go towards infinity" or at least, a really big number, then:C/(C+1) goes towards 1. A /(C+1) goes towards 0.V/(C+1) goes towards 0. All in all:A1 = 1 * 0 + 0So put shortly A1 goes towards 0 - seems a bit off.
kastermester
The set is very large, but it is finite. C is very large and grows over time, but it's not infinite. I do address the precision problem in the last paragraph.
Joel Coehoorn
It is finite yes, but so is our computers - I think with this above stated formula, if indeed dealing with Gigabytes of data, you will end up running into these limitations and, well be limited by them. Imagine for row number 1million:A1 = 1,000,000/1,000,001 * A/1,000,001 + V/1,000,001The first bit here alone will eventually evaluate to 0.9999999 (conetinue till end of precision) and never change, regardless of the fact that it should go closer to 1 every single time. Depending of the values we're dealing with, the rest will go rather close to 0 as well.
kastermester
People voted for this? I left it up (with comments) as a bad example :(
Joel Coehoorn
+3  A: 

This is a classic divide-and-conquer type problem.

The issue is that the average of a large set of numbers is the same as the average of the first-half of the set, averaged with the average of the second-half of the set.

In other words:

AVG(A[1..N]) == AVG( AVG(A[1..N/2]), AVG(A[N/2..N]) )

Here is a simple, C#, recursive solution. Its passed my tests, and should be completely correct.

public struct SubAverage
{
    public float Average;
    public int   Count;
};

static SubAverage AverageMegaList(List<float> aList)
{
    if (aList.Count <= 500) // Brute-force average 500 numbers or less.
    {
        SubAverage avg;
        avg.Average = 0;
        avg.Count   = aList.Count;
        foreach(float f in aList)
        {
            avg.Average += f;
        }
        avg.Average /= avg.Count;
        return avg;
    }

    // For more than 500 numbers, break the list into two sub-lists.
    SubAverage subAvg_A = AverageMegaList(aList.GetRange(0, aList.Count/2));
    SubAverage subAvg_B = AverageMegaList(aList.GetRange(aList.Count/2, aList.Count-aList.Count/2));

    SubAverage finalAnswer;
    finalAnswer.Average = subAvg_A.Average * subAvg_A.Count/aList.Count + 
                          subAvg_B.Average * subAvg_B.Count/aList.Count;
    finalAnswer.Count = aList.Count;

    Console.WriteLine("The average of {0} numbers is {1}",
        finalAnswer.Count, finalAnswer.Average);
    return finalAnswer;
}
abelenky
A: 

Why not just scale the numbers (down) before computing the average?

Esteban Araya
A: 

Sorry for the late comment, but isn't it the formula above provided by Joel Coehoorn rewritten wrongly?

I mean, the basic formula is right:

Given:

A = current avg C = count of items V = next value in the sequence

The next average (A1) is:

A1 = ( (C * A) + V ) / ( C + 1 )

But instead of:

A1 = C/(C+1) * A/(C+1) + V/(C+1)

shouldn't we have:

A1 = C/(C+1) * A + V/(C+1)

That would explain kastermester's post:

"My math ticks off here - You have C, which you say "go towards infinity" or at least, a really big number, then: C/(C+1) goes towards 1. A /(C+1) goes towards 0. V/(C+1) goes towards 0. All in all: A1 = 1 * 0 + 0 So put shortly A1 goes towards 0 - seems a bit off. – kastermester"

Because we would have A1 = 1 * A + 0, i.e., A1 goes towards A, which it's right.

I've been using such method for calculating averages for a long time and the aforementioned precision problems have never been an issue for me.