ansaurus

Question

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

Answer 1

+6 A:

You could take the average of averages of equal-sized subsets of numbers that don't exceed the limit.

David M 2009-12-18 20:21:02

A much better example of this is below: http://stackoverflow.com/questions/1930454/what-is-a-good-solution-for-calculating-an-average-where-the-sum-of-all-values-ex/1930913#1930913

Davide 2009-12-18 23:44:08

Answer 2

+7 A:

divide all values by the set size and then sum it up

Alon 2009-12-18 20:21:08

you might hit underflows, though, instead of overflow

Davide 2009-12-18 20:24:27

This involves many more divisions than are necessary though.

Will Bickford 2009-12-18 20:25:51

Interesting idea, but not a good one!

Paul Creasey 2009-12-18 20:39:14

If dividing by the number of elements in the total input will give "underflow", as in "loss of precision", then I'd venture a guess that *double* isn't a good enough data type for this problem.

Lasse V. Karlsen 2009-12-18 21:04:26

@davide: java double range is 4.94065645841246544e-324d to 1.79769313486231570e+308d, if the numbers are so small to cause an underflow then there would not have been the risk of an overflow. if there is a combination of large and small number its recommended to use the split sets approach described in the other good answers but it will make the function a bit less readable, then the simplistic approche.

Alon 2009-12-18 21:10:54

Right direction, but this naïve approach this will lead to loss of precision. You better divide large values by some power of two, and leave small values as they are. See my solution.

Adrian 2009-12-19 21:42:53

Answer 3

+12 A:

Use associativity. Average subsets of the set, then compute the average of the averages.

The average of {1, 2, 3, 4, 5, 6} can be computed as

(1 + 2 + 3 + 4 + 5 + 6) / 6

Or it can be computed as (for example)

((1 + 2 + 3) / 3 + (4 + 5 + 6) / 3) / 2

Nate C-K 2009-12-18 20:21:39

are there any concerns about precision of this method given the potentially large number of divisions?

Simon 2009-12-18 20:36:01

Yes, I would think there are concerns about the precision. That will also be an issue if the numbers in the list have a considerable variation in their magnitude. The second method I just posted here is probably better in this respect since it minimizes the amount that the exponent is scaled up and down during the process.

Nate C-K 2009-12-18 20:42:16

What about the sequence 1, 2, 3, 4, 5, 6, 7?

Lasse V. Karlsen 2009-12-18 20:44:39

What about it? Are you referring to the fact that you need to account for sets with an odd number of elements?

Nate C-K 2009-12-18 20:48:36

Did you check that last piece of code? The first value in the list would have really little impact on the result, but the last item would account for roughly half of it. Why is this answer being upvoted?

Lasse V. Karlsen 2009-12-18 20:49:04

Are you sure that your edited approach works? Let's say you have values (1,2,3), than that would be ((1+2)/2+3)/2 which is 2.25...

Peter Lang 2009-12-18 20:49:20

This approach only works if a) You know the number of values beforehand, and b) you can divide the numbers into equally sized sets. If you can't, this won't work. He doesn't mention any of that though. The code-piece there is just wrong, but the approach is sound assuming you can divide the numbers up into sets.

Lasse V. Karlsen 2009-12-18 20:51:51

He did not specify that it is an ordered sequence. He said it's a set of numbers. If you know something in advance about the characteristics of the set then you can optimize the handling, otherwise plowing through it in order is a reasonable approach. If you're concerned about maximizing faithfulness to the original precision then you might want something like what Davide proposes -- but sorting a data set this size has a major performance cost.

Nate C-K 2009-12-18 20:53:03

The second piece is a different solution to the problem. It is not supposed to be an implementation of the first solution.

Nate C-K 2009-12-18 20:54:37

But it is *not* a solution to the problem. For a simple N-numbered set, myList[0] will account for 1/(2^N) of the sum, and myList[N-1] will account 0.5 of the sum. That code solves *a* problem, but nothing related to the question.

Lasse V. Karlsen 2009-12-18 20:55:57

Yes, you're right, I see it now. Sorry for being slow on that. I'm not sure what I was thinking.

Nate C-K 2009-12-18 20:57:10

Your code, for a 3-value list consisting of 1, 2, and 3, is equivalent to: (((1 + 2) / 2) + 3) / 2. Are you saying that the average of 1, 2 and 3, is 2.25?

Lasse V. Karlsen 2009-12-18 20:58:26

Also:(1 + 2 + 3 + 4 + 5 + 6) / 6 can be broken up into 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6. So you can just take each number in your list, divide it by the number of elements in the list, adding the results of each division into a variable.

Waleed Al-Balooshi 2009-12-18 21:01:38

But... if the *double* data type doesn't have enough precision to hold the sum, does it have enough precision to hold the sum of *values/count* accurately enough to be usable?

Lasse V. Karlsen 2009-12-18 22:06:21

Answer 4

+5 A:

Option 1 is to use an arbitrary-precision library so you don't have an upper-bound.

Other options (which lose precision) are to sum in groups rather than all at once, or to divide before summing.

Anon. 2009-12-18 20:22:43

Most of the other answers are assuming they know more about your problem than they do. These are the correct general-purpose answers. Without knowing more about your problem, including all your requirements, more about your specific data, etc, any optimization advice is worthless.

Merlyn Morgan-Graham 2009-12-18 21:36:17

Answer 5

+10 A:

Apart from using the better approaches already suggested, you can use BigDecimal to make your calculations. (Bear in mind it is immutable)

Bozho 2009-12-18 20:25:49

And many, many times slower.

Peter Lawrey 2009-12-18 21:04:07

Sure. But also easier and with less thinking. Which I discourage, but..

Bozho 2009-12-18 21:06:11

Don't make life more difficult for yourself unless absolutely necessary - if you need to deal with very large numbers or high precision and you can sacrifice time then using complex number types is a good way to go.

cyborg 2009-12-18 21:07:48

Slower is a relative term. In the case of calculating the average of 10^9 values, with BigDecimal, slow is several minutes (maybe even 30)... If a faster algorithm is required, the BigDecimal approach would be great to verify the faster implementation.

Fedearne 2009-12-18 23:38:27

Bad idea, this will create 10^9 objects for nothing. Since all input numbers fit into double range, the mean will also fit into double range, thus a solution using doubles only is possible.

Adrian 2009-12-19 21:47:12

Of course. I did propose using the better approaches already suggested. (Some of those objects should be garbage-collected at some point.)

Bozho 2009-12-19 22:46:41

Answer 6

+10 A:

IMHO, the most robust way of solving your problem is

sort your set
split in groups of elements whose sum wouldn't overflow - since they are sorted, this is fast and easy
do the sum in each group - and divide by the group size
do the sum of the group's sum's (possibly calling this same algorithm recursively) - be aware that if the groups will not be equally sized, you'll have to weight them by their size

One nice thing of this approach is that it scales nicely if you have a really large number of elements to sum - and a large number of processors/machines to use to do the math

Davide 2009-12-18 20:30:50

Note that the size of the sets must be equal, ie. 3-value sets, or 15-value sets, or whatever-value sets, but you can't mix different value sets, or values in smaller sets will have a higher impact on the result than those in larger sets.

Lasse V. Karlsen 2009-12-18 20:54:31

This approach can also be made multi-threaded to make sue of all the CPUs in your system.

Peter Lawrey 2009-12-18 21:03:35

You want to avoid sorting a large collection of items if you can. The average does not depend upon the order of items, so this is a lot of extra work. Scanning for the largest N elements would give you enough information to make informed group size choices I think.

Will Bickford 2009-12-18 21:09:07

@Lasse not necessarily. You just have to weight the sum appropriately (=depending on the size set)

Davide 2009-12-18 21:17:55

@Will: in math, the sum does not depend on the order of items. In floating pointing arithmetics, it does. The most robust way to solve the sum problem, is indeed the one I wrote: sort and sum in chunks. It is not the fastest, but it's safe, and easily parallelized.

Davide 2009-12-18 21:24:29

in Java terms, 'sorting a set' sounds inappropriate ;)

Bozho 2009-12-18 21:37:31

Are you dividing in there somewhere? :-)

Ken 2009-12-18 21:42:11

I've picked up the gauntlet, please tell me where I'm being a fool: http://stackoverflow.com/questions/1931359/how-to-reduce-calculation-of-average-to-sub-sets-in-a-general-way

Lasse V. Karlsen 2009-12-18 23:55:41

And note that I'm going to have/find a reallllly good excuse for why I'm not seeing the solution right now, regardless of how obvious it is :) Like, it's 1 in the morning sunday ... :) - Seriously though, I really can't see this solution and I do have present code that suffers a similar problem (not with the limitation to "double", but more about how I retrieve the values) where I need to calculate the averaage; any good solution to this which makes me slap my forehead would be a great benefit.

Lasse V. Karlsen 2009-12-18 23:58:32

Bad idea, sorting a large array is time consuming. You better group values into large and small ones as you iterate over them. See my solution.

Adrian 2009-12-19 21:46:04

Note that this, as well as most of the other answers, requires you to store all of the data in memory (to sort it, split it into groups, etc.), which may not be possible if we're talking 10^9 values (depends on the platform, obviously). My answer and martinus's do not require storing the data in memory.

Dan Tao 2009-12-19 23:46:38

If you want a faster sort using less memory, just sort on the exponent, not the mantissa.

Christopher Edwards 2009-12-19 23:53:25

@dan, mine neither.

Adrian 2009-12-20 17:35:30

Answer 7

+6 A:

A double can be divided by a power of 2 without loss of precision. So if your only problem if the absolute size of the sum you could pre-scale your numbers before summing them. But with a dataset of this size, there is still the risk that you will hit a situation where you are adding small numbers to a large one, and the small numbers will end up being mostly (or completely) ignored.

for instance, when you add 2.2e-20 to 9.0e20 the result is 9.0e20 because once the scales are adjusted so that they numbers can be added together, the smaller number is 0. Doubles can only hold about 17 digits, and you would need more than 40 digits to add these two numbers together without loss.

So, depending on your data set and how many digits of precision you can afford to loose, you may need to do other things. Breaking the data into sets will help, but a better way to preserve precision might be to determine a rough average (you may already know this number). then subtract each value from the rough average before you sum it. That way you are summing the distances from the average, so your sum should never get very large.

Then you take the average delta, and add it to your rough sum to get the correct average. Keeping track of the min and max delta will also tell you how much precision you lost during the summing process. If you have lots of time and need a very accurate result, you can iterate.

John Knoeller 2009-12-18 21:08:53

To sum large and small values, one should use Kahan summations.

Adrian 2009-12-19 19:31:53

Answer 8

+7 A:

The very first issue I'd like to ask you is this:

Do you know the number of values beforehand?

If not, then you have little choice but to sum, and count, and divide, to do the average. If Double isn't high enough precision to handle this, then tough luck, you can't use Double, you need to find a data type that can handle it.

If, on the other hand, you do know the number of values beforehand, you can look at what you're really doing and change how you do it, but keep the overall result.

The average of N values, stored in some collection A, is this:

A[0]   A[1]   A[2]   A[3]          A[N-1]   A[N]
---- + ---- + ---- + ---- + .... + ------ + ----
 N      N      N      N               N       N

To calculate subsets of this result, you can split up the calculation into equally sized sets, so you can do this, for 3-valued sets (assuming the number of values is divisable by 3, otherwise you need a different divisor)

/ A[0]   A[1]   A[2] \   / A[3]   A[4]   A[5] \   //      A[N-1]   A[N] \
| ---- + ---- + ---- |   | ---- + ---- + ---- |   \\    + ------ + ---- |
\  3      3      3   /   \  3      3      3   /   //        3       3   /
 --------------------- +  --------------------  + \\      --------------
          N                        N                        N
         ---                      ---                      ---
          3                        3                        3

Note that you need equally sized sets, otherwise numbers in the last set, which will not have enough values compared to all the sets before it, will have a higher impact on the final result.

Consider the numbers 1-7 in sequence, if you pick a set-size of 3, you'll get this result:

/ 1   2   3 \   / 4   5   6 \   / 7 \ 
| - + - + - | + | - + - + - | + | - |
\ 3   3   3 /   \ 3   3   3 /   \ 3 /
 -----------     -----------     ---
      y               y           y

which gives:

     2   5   7/3
     - + - + ---
     y   y    y

If y is 3 for all the sets, you get this:

     2   5   7/3
     - + - + ---
     3   3    3

which gives:

2*3   5*3    7
--- + --- + ---
 9     9     9

which is:

6   15   7
- + -- + -
9    9   9

which totals:

28
-- ~ 3,1111111111111111111111.........1111111.........
 9

The average of 1-7, is 4. Obviously this won't work. Note that if you do the above exercise with the numbers 1, 2, 3, 4, 5, 6, 7, 0, 0 (note the two zeroes at the end there), then you'll get the above result.

In other words, if you can't split the number of values up into equally sized sets, the last set will be counted as though it has the same number of values as all the sets preceeding it, but it will be padded with zeroes for all the missing values.

So, you need equally sized sets. Tough luck if your original input set consists of a prime number of values.

What I'm worried about here though is loss of precision. I'm not entirely sure Double will give you good enough precision in such a case, if it initially cannot hold the entire sum of the values.

Lasse V. Karlsen 2009-12-18 21:13:04

You can trivially have not-equally-sized sets if you weight them appropriately.

Davide 2009-12-18 21:33:19

Please tell me how that works, I would love to learn how to do this correctly because I have a similar problem in a private project of mine and I have yet to find a good solution! For instance, tell me how to weigh the simple sequence of the values from 1 through 7, in such a way that I don't have to sum them all up together.

Lasse V. Karlsen 2009-12-18 21:58:54

... let me emphasis this. Please prove me wrong, I need this solution as well.

Lasse V. Karlsen 2009-12-18 21:59:55

my answer has the trivial case for non-equally sized sets.

Carl 2009-12-18 22:05:38

well, as does Davide's, though he wasn't explicit about the weighting is done.

Carl 2009-12-18 22:07:23

@Lasse ask the detailed question on that famous site called stackoverflow, and you'll get several good answers :-) If you want **my** answer, be sure to link your question under one of my answers - I think it's trivial, so I'm not editing mine here per your request in a comment to yours

Davide 2009-12-18 23:42:44

Ok, I'll pick up that gauntlet.

Lasse V. Karlsen 2009-12-18 23:43:34

Ok, posted: http://stackoverflow.com/questions/1931359/how-to-reduce-calculation-of-average-to-sub-sets-in-a-general-way, please let me know if you think I've changed the problem in any significant way.

Lasse V. Karlsen 2009-12-18 23:54:49

Answer 9

A:

Check out the section for cummulative moving average

basszero 2009-12-18 21:20:37

That wouldn't solve the problem as incrementing the average effectively requires recovering the Nth-1 sum, which here would still be large...

Mark E 2009-12-19 05:02:45

Answer 10

+3 A:

So I don't repeat myself so much, let me state that I am assuming that the list of numbers is normally distributed, and that you can sum many numbers before you overflow. The technique still works for non-normal distros, but somethings will not meet the expectations I describe below.

--

Sum up a sub-series, keeping track of how many numbers you eat, until you approach the overflow, then take the average. This will give you an average a0, and count n0. Repeat until you exhaust the list. Now you should have many ai, ni.

Each ai and ni should be relatively close, with the possible exception of the last bite of the list. You can mitigate that by under-biting near the end of the list.

You can combine any subset of these ai, ni by picking any ni in the subset (call it np) and dividing all the ni in the subset by that value. The max size of the subsets to combine is the roughly constant value of the n's.

The ni/np should be close to one. Now sum ni/np * ai and multiple by np/(sum ni), keeping track of sum ni. This gives you a new ni, ai combination, if you need to repeat the procedure.

If you will need to repeat (i.e., the number of ai, ni pairs is much larger than the typical ni), try to keep relative n sizes constant by combining all the averages at one n level first, then combining at the next level, and so on.

Carl 2009-12-18 22:01:07

fyi, this is essentially Davide's answer, without the pre-sort. The pre-sort should reduce numerical error, but perhaps not at the level that matters relative to the expense of the sort.

Carl 2009-12-18 22:11:02

Can you give a specific example of how to handle the sequence 1-7? Your answer looks promising, but perhaps it's late, but I can't wrap my head around it fully :P

Lasse V. Karlsen 2009-12-18 22:32:52

Answer 11

+1 A:

A random sampling of a small set of the full dataset will often result in a 'good enough' solution. You obviously have to make this determination yourself based on system requirements. Sample size can be remarkably small and still obtain reasonably good answers. This can be adaptively computed by calculating the average of an increasing number of randomly chosen samples - the average will converge within some interval.

Sampling not only addresses the double overflow concern, but is much, much faster. Not applicable for all problems, but certainly useful for many problems.

Kevin Day 2009-12-19 04:50:26

Sampling might be useful in some contexts, but normality has nothing to do with it.

Rob Hyndman 2009-12-20 06:21:00

Rob - good point - thanks.

Kevin Day 2009-12-21 02:43:03

Answer 12

+2 A:

I posted an answer to a question spawned from this one, realizing afterwards that my answer is better suited to this question than to that one. I've reproduced it below. I notice though, that my answer is similar to a combination of Bozho's and Anon_^.'s.

As the other question was tagged language-agnostic, I chose C# for the code sample I've included. Its relative ease of use and easy-to-follow syntax, along with its inclusion of a couple of features facilitating this routine (a DivRem function in the BCL, and support for iterator functions), as well as my own familiarity with it, made it a good choice for this problem. Since the OP here is interested in a Java solution, but I'm not Java-fluent enough to write it effectively, it might be nice if someone could add a translation of this code to Java.

Some of the mathematical solutions here are very good. Here's a simple technical solution.

Use a larger data type. This breaks down into two possibilities:

Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.

I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple ~~(untested)~~ example in C# follows

class BigMeanSet{
    List<uint> list = new List<uint>();

    public double GetAverage(IEnumerable<uint> values){
        list.Clear();
        list.Add(0);

        uint count = 0;

        foreach(uint value in values){
            Add(0, value);
            count++;
        }

        return DivideBy(count);
    }

    void Add(int listIndex, uint value){
        if((list[listIndex] += value) < value){ // then overflow has ocurred
            if(list.Count == listIndex + 1)
                list.Add(0);
            Add(listIndex + 1, 1);
        }
    }

    double DivideBy(uint count){
        const double shift = 4.0 * 1024 * 1024 * 1024;

        double rtn       = 0;
        long   remainder = 0;

        for(int i = list.Count - 1; i >= 0; i--){
            rtn *= shift;
            remainder <<= 32;
            rtn += Math.DivRem(remainder + list[i], count, out remainder);
        }

        rtn += remainder / (double)count;

        return rtn;
    }
}

Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.

This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 2³² - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.

In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.

~~If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.~~

I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).

I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 2³² - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.

I then tested with 100^* large series, with random length between 10⁵ and 10⁹. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.

^{_{^*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 10⁹ takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.}}

For those interested, my test code is below:

static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
    for(uint i = lowerbound; i <= upperbound; i++)
        yield return i;
}

static void Test(){
    Console.BufferHeight = 1200;
    Random rnd = new Random();

    for(int i = 0; i < 1000; i++){
        uint[] numbers = new uint[rnd.Next(1, 1000)];
        for(int j = 0; j < numbers.Length; j++)
            numbers[j] = (uint)rnd.Next();

        double sum = 0;
        foreach(uint n in numbers)
            sum += n;

        double avg = sum / numbers.Length;
        double ans = new BigMeanSet().GetAverage(numbers);

        Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);

        if(avg != ans)
            Debugger.Break();
    }

    for(int i = 0; i < 100; i++){
        uint length     = (uint)rnd.Next(100000, 1000000001);
        uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
        uint upperbound = lowerbound + length;

        double avg = ((double)lowerbound + upperbound) / 2;
        double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));

        Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);

        if(avg != ans)
            Debugger.Break();
    }
}

P Daddy 2009-12-19 17:02:54

Answer 13

+7 A:

Please clarify the potential ranges of the values.

Given that a double has a range ~= +/-10^308, and you're summing 10^9 values, the apparent range suggested in your question is values of the order of 10^299.

That seems somewhat, well, unlikely...

If your values really are that large, then with a normal double you've got only 17 significant decimal digits to play with, so you'll be throwing away about 280 digits worth of information before you can even think about averaging the values.

I would also note (since no-one else has) that for any set of numbers X:

mean(X) = sum(X[i] - c)  +  c
          -------------
                N

for any arbitrary constant c.

In this particular problem, setting c = min(X) might dramatically reduce the risk of overflow during the summation.

May I humbly suggest that the problem statement is incomplete...?

Alnitak 2009-12-19 17:18:54

Answer 14

A:

(n1+n2+...+nk) / k = (n1+n2) / k + (n3+n4) / k +...(nk-1+nk) / k, if k is even (n1+n2+...+nk) / k = n1 / k + (n3+n4) / k +...(nk-1+nk) / k, if k is odd

D_K 2009-12-19 18:56:08

Answer 15

+2 A:

First of all, make yourself familiar with the internal representation of double values. Wikipedia should be a good starting point.

Then, consider that doubles are expressed as "value plus exponent" where exponent is a power of two. The limit of the largest double value is an upper limit of the exponent, and not a limit of the value! So you may divide all large input numbers by a large enough power of two. This should be safe for all large enough numbers. You can re-multiply the result with the factor to check whether you lost precision with the multiplication.

Here we go with an algorithm

public static double sum(double[] numbers) { 
  double eachSum, tempSum;
  double factor = Math.pow(2.0,30); // about as large as 10^9
  for (double each: numbers) {
    double temp = each / factor;
    if (t * factor != each) {
      eachSum += each;
    else {
      tempSum += temp;
    }
  }
  return (tempSum / numbers.length) * factor + (eachSum / numbers.length);
}

and dont be worried by the additional division and multiplication. The FPU will optimize the hell out of them since they are done with a power of two (for comparison imagine adding and removing digits at the end of a decimal numbers).

PS: in addition, you may want to use Kahan summation to improve the precision. Kahan summation avoids loss of precision when very large and very small numbers are summed up.

Adrian 2009-12-19 19:13:34

Answer 16

+8 A:

You can calculate the mean iteratively. This algorithm is simple, quite fast, and you have to process each value just once, and the variables never get larger than the largest value in the set, so you won't get an overflow.

double mean(double[] ary) {
  double avg = 0;
  int t = 1;
  for (double x : ary) {
    avg += (x - avg) / t;
    ++t;
  }
  return avg;
}

Inside the loop avg always is the average value of all values processed so far. In other words, if all the values are finite you should not get an overflow.

martinus 2009-12-19 22:08:15

I think this as a solid solution to the OPs question. Nice.

Kevin Day 2009-12-21 02:47:22

If you have a large number of values to average (which is the only case in which you would have the problem that the sum overflows a double), then this algorithm will have severe underflow issues. Essentially, at some point, (x-avg) becomes zero.

Martin B 2009-12-22 08:25:48

you only have the underflow issues when x and average is very close to each other, so I do not think it is an issue

martinus 2009-12-22 11:58:56

note also that if underflow is a concern, the order of magnitude of avg can be monitored and the avg re-baselined by a fixed multiplier. x would have to be divided by the multiplier in the above code. Finally, with this size of set, it is highly likely that random sampling will produce an acceptable result (engineering is about being good enough, not perfect).

Kevin Day 2009-12-23 03:17:17

@Martin B: This method is numerically stable and recommended in Knuth, The Art of Computer Programming Vol 2, section 4.2.2. It is by the way the only sensible answer posted until now, so please upvote!!!

jug 2009-12-29 13:59:59

Answer 17

A:

Consider this:

avg(n1)         : n1                               = a1
avg(n1, n2)     : ((1/2)*n1)+((1/2)*n2)            = ((1/2)*a1)+((1/2)*n2) = a2
avg(n1, n2, n3) : ((1/3)*n1)+((1/3)*n2)+((1/3)*n3) = ((2/3)*a2)+((1/3)*n3) = a3

So for any set of doubles of arbitrary size, you could do this (this is in C#, but I'm pretty sure it could be easily translated to Java):

static double GetAverage(IEnumerable<double> values) {
    int i = 0;
    double avg = 0.0;
    foreach (double value in values) {
        avg = (((double)i / (double)(i + 1)) * avg) + ((1.0 / (double)(i + 1)) * value);
        i++;
    }

    return avg;
}

Actually, this simplifies nicely into (already provided by martinus):

static double GetAverage(IEnumerable<double> values) {
    int i = 1;
    double avg = 0.0;
    foreach (double value in values) {
        avg += (value - avg) / (i++);
    }

    return avg;
}

I wrote a quick test to try this function out against the more conventional method of summing up the values and dividing by the count (GetAverage_old). For my input I wrote this quick function to return as many random positive doubles as desired:

static IEnumerable<double> GetRandomDoubles(long numValues, double maxValue, int seed) {
    Random r = new Random(seed);
    for (long i = 0L; i < numValues; i++)
        yield return r.NextDouble() * maxValue;

    yield break;
}

And here are the results of a few test trials:

long N = 100L;
double max = double.MaxValue * 0.01;

IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 1.00535024998431E+306
double newWay = GetAverage(doubles); // 1.00535024998431E+306

doubles = GetRandomDoubles(N, max, 1);
oldWay = GetAverage_old(doubles); // 8.75142021696299E+305
newWay = GetAverage(doubles); // 8.75142021696299E+305

doubles = GetRandomDoubles(N, max, 2);
oldWay = GetAverage_old(doubles); // 8.70772312848651E+305
newWay = GetAverage(doubles); // 8.70772312848651E+305

OK, but what about for 10^9 values?

long N = 1000000000;
double max = 100.0; // we start small, to verify accuracy

IEnumerable<double> doubles = GetRandomDoubles(N, max, 0);
double oldWay = GetAverage_old(doubles); // 49.9994879713857
double newWay = GetAverage(doubles); // 49.9994879713868 -- pretty close

max = double.MaxValue * 0.001; // now let's try something enormous

doubles = GetRandomDoubles(N, max, 0);
oldWay = GetAverage_old(doubles); // Infinity
newWay = GetAverage(doubles); // 8.98837362725198E+305 -- no overflow

Naturally, how acceptable this solution is will depend on your accuracy requirements. But it's worth considering.

Dan Tao 2009-12-19 23:17:44

Answer 18

A:

Why so many complicated long answers. Here is the simplest way to find the running average till now without any need to know how many elements or size etc..

long int i = 0; double average = 0; while(there are still elements) { average = average * (i / i+1) + X[i] / (i+1); i++; } return average;

Anil 2010-06-02 06:33:33

-1 - This is a repeat of Martinus's answer (http://stackoverflow.com/questions/1930454/what-is-a-good-solution-for-calculating-an-average-where-the-sum-of-all-values-ex/1934266#1934266).

mtrw 2010-06-02 06:43:56

ansaurus

tags:

views:

answers:

What is a good solution for calculating an average where the sum of all values exceeds a double's limits?

related questions