views:

109

answers:

5

suppose I want to calculate average value of a data-set such as

class Averager {
   float total;
   size_t count;
   float addData (float value) {
       this->total += value;
       return this->total / ++this->count;
   }
}

sooner or later the total or count value will overflow, so I make it doesn't remember the total value by :

class Averager {
   float currentAverage;
   size_t count;
   float addData (float value) {
       this->currentAverage = (this->currentAverage*count + value) / ++count;
       return this->currentAverage;
   }
}

it seems they will overflow longer, but the multiplication between average and count lead to overflow problem, so next solution is:

class Averager {
   float currentAverage;
   size_t count;
   float addData (float value) {
       this->currentAverage += (value - this->currentAverage) / ++count;
       return this->currentAverage;
   }
}

seems better, next problem is how to prevent count from overflow?

+2  A: 

Use double total; unsigned long long count;. You should still worry about accuracy, but it will be much less of a problem than with float.

Marcelo Cantos
I need statistic approach to this problem, no matter how large the data type of `counter` sooner or later it will overflow
uray
Or a big int :)
Peter Alexander
@uray: Are you serious? A 64-bit counter will take 70 years to wrap around even if you increment it 4 billion times a second. Are you expecting a 70-year uptime for your program? Are you expecting to have to average more than 18 quintillion numbers?
Marcelo Cantos
@uray and Marcelo: I see no possible solution then. Eventually the virtual memory will be exhausted or the universe will collapse. ;-)
Peter G.
@marcelo:the problem is as count grow, this part `(value - this->currentAverage) / ++count` will become zero and no calculation required, so larger data type is useless.
uray
@uray: Marcelo pointed that out to you as a comment to your question, and you said it doesn't matter. I think you need to need to nail down what problem you're trying to solve.
Dennis Zickefoose
@dennis:yea, based on my first comment, marcelo answer is correct by increasing counter data type, but my point is I'm looking solution without changing the datatype, used counter datatype is float (as my code example) or less would be better.
uray
+5  A: 

Aggregated buckets.

We pick a bucket size that's comfortably less than squareRoot(MAXINT). To keep it simple, let's pick 10.

Each new value is added to the current bucket, and the moving average can be computed as you describe.

When the bucket is full start a new bucket, remembering the average of the full bucket. We can safely calculate the overall average by combining the averages of the full buckets and the current, partial bucket. When we get to 10 full buckets, we create a bigger bucket, capacity 100.

To compute the total average we first compute the average of the "10s" and then combine that with the "100s". This pattern repeats for "1,000s" "10,000s" and so on. At each stage we only need to consider two levels one 10 x bigger than the previous one.

djna
I think what you describe is the "Moving Average" (http://en.wikipedia.org/wiki/Moving_average)
DR
@DR I think it's a bit more than that - classic moving average needs to maintain a count of how many items we have seen, and eventually this can overflow. The technique I describe avoids that problem.
djna
Ah, ok, you are right.
DR
A: 

You could use these special datatypes where integeres can grow infinitely until your RAM is full.

Henri
+1  A: 

What about using Arbitrary-precision arithmetic ?

There's a list of libraries you could use on Wikipedia: http://en.wikipedia.org/wiki/Bignum#Libraries

Most of Arbitrary-precision arithmetic libraries will not overflow until the number of digits stored fill the available memory (which is quite unlikely).

SirDarius
A: 

You want to use kahan's summation algorithm:

http://en.wikipedia.org/wiki/Kahan_summation_algorithm

See also the section about errors in summation in "What Every Computer Scientist Should Know About Floating-Point Arithmetic"

http://docs.sun.com/source/806-3568/ncg_goldberg.html#1262

Kahan's algorithm reduces the precision lost, but won't solve the overflow problem.
KennyTM