ansaurus

Question

How to reduce calculation of average to sub-sets in a general way?

Answer 1

+4 A:

If you know the number of values beforehand (say it's N), you just add 1/N + 2/N + 3/N etc, supposing that you had values 1, 2, 3. You can split this into as many calculations as you like, and just add up your results. It may lead to a slight loss of precision, but this shouldn't be an issue unless you also need a super-accurate result.

If you don't know the number of items ahead of time, you might have to be more creative. But you can, again, do it progressively. Say the list is 1, 2, 3, 4. Start with mean = 1. Then mean = mean*(1/2) + 2*(1/2). Then mean = mean*(2/3) + 3*(1/3). Then mean = mean*(3/4) + 4*(1/4) etc. It's easy to generalize, and you just have to make sure the bracketed quantities are calculated in advance, to prevent overflow.

Of course, if you want extreme accuracy (say, more than 0.001% accuracy), you may need to be a bit more careful than this, but otherwise you should be fine.

Peter 2009-12-19 00:06:13

Let me augment the question, because that's not an option in this case.

Lasse V. Karlsen 2009-12-19 00:07:50

Note that your multiplication/division solution, for the case of the original question, will get into the same trouble eventually. At some point, the data type used to hold the value isn't going to have enough fidelity/range to hold the correct value.

Lasse V. Karlsen 2009-12-19 00:13:23

Let me give a reason why you might not want to do so: underflow (of course it doesn't apply for just 7 values) - see the original answer.

Davide 2009-12-19 00:14:52

Davide, that's a good point, one that I raised myself, but more of a concern. I did not object to the answers because of underflow, but just because I just did not understand how, given the premises that the total sum would overflow the data type, any sub-division would correctly work. But, as I said, it's sunday morning, 1 o'clock, and all that ;)

Lasse V. Karlsen 2009-12-19 00:26:17

@Lasse: In what time zone is it Sunday morning when it's Saturday morning UTC?

P Daddy 2009-12-19 02:44:55

Answer 2

A:

When you split the numbers into sets you're just dividing by the total number or am I missing something?

You have written it as

/ 1   2   3 \   / 4   5   6 \
| - + - + - | + | - + - + - |
\ 3   3   3 /   \ 3   3   3 /
 ----------      -----------
      2               2

but that's just

/ 1   2   3 \   / 4   5   6 \
| - + - + - | + | - + - + - |
\ 6   6   6 /   \ 6   6   6 /

so for the numbers from 1 to 7 one possible grouping is just

/ 1   2   3 \   / 4   5   6 \   / 7 \
| - + - + - | + | - + - + - | + | - |
\ 7   7   7 /   \ 7   7   7 /   \ 7 /

Troubadour 2009-12-19 00:06:15

Answer 3

+6 A:

Well, suppose you added three numbers and divided by three, and then added two numbers and divided by two. Can you get the average from these?

x = (a + b + c) / 3
y = (d + e) / 2
z = (f + g) / 2

And you want

r = (a + b + c + d + e + f + g) / 7

That is equal to

r = (3 * (a + b + c) / 3 + 2 * (d + e) / 2 + 2 * (f + g) / 2) / 7
r = (3 * x + 2 * y + 2 * z) / 7

Both lines above overflow, of course, but since division is distributive, we do

r = (3.0 / 7.0) * x + (2.0 / 7.0) * y + (2.0 / 7.0) * z

Which guarantees that you won't overflow, as I'm multiplying x, y and z by fractions less than one.

This is the fundamental point here. Neither I'm dividing all numbers beforehand by the total count, nor am I ever exceeding the overflow.

So... if you you keep adding to an accumulator, keep track of how many numbers you have added, and always test if the next number will cause an overflow, you can then get partial averages, and compute the final average.

And no, if you don't know the values beforehand, it doesn't change anything (provided that you can count them as you sum them).

Here is a Scala function that does it. It's not idiomatic Scala, so that it can be more easily understood:

def avg(input: List[Double]): Double = {
  var partialAverages: List[(Double, Int)] = Nil
  var inputLength = 0
  var currentSum = 0.0
  var currentCount = 0
  var numbers = input

  while (numbers.nonEmpty) {
    val number = numbers.head
    val rest = numbers.tail
    if (number > 0 && currentSum > 0 && Double.MaxValue - currentSum < number) {
      partialAverages = (currentSum / currentCount, currentCount) :: partialAverages
      currentSum = 0
      currentCount = 0
    } else if (number < 0 && currentSum < 0 && Double.MinValue - currentSum > number) {
      partialAverages = (currentSum / currentCount, currentCount) :: partialAverages
      currentSum = 0
      currentCount = 0
    }
    currentSum += number
    currentCount += 1
    inputLength += 1
    numbers = rest
  }
  partialAverages = (currentSum / currentCount, currentCount) :: partialAverages

  var result = 0.0
  while (partialAverages.nonEmpty) {
    val ((partialSum, partialCount) :: rest) = partialAverages
    result += partialSum * (partialCount.toDouble / inputLength)
    partialAverages = rest
  }

  result
}

EDIT: Won't multiplying with 2, and 3, get me back into the range of "not supporter by the data type?"

No. If you were diving by 7 at the end, absolutely. But here you are dividing at each step of the sum. Even in your real case the weights (2/7 and 3/7) would be in the range of manageble numbers (e.g. 1/10 ~ 1/10000) which wouldn't make a big difference compared to your weight (i.e. 1).

PS: I wonder why I'm working on this answer instead of writing mine where I can earn my rep :-)

Daniel 2009-12-19 00:08:57

This would have been more or less my answer that I am typing. Kudos to you for being faster (I'm discarding mine). Kudos also for the better notation :-)

Davide 2009-12-19 00:12:22

But won't that just get me back to my original problem? The original problem, as stated by the original question, is that the "Double" data type cannot hold the total sum to be averaged. When I multiply, in your code, won't that collide with the same limitations?

Lasse V. Karlsen 2009-12-19 00:14:36

@Lasse, there was a line which was overflowing, but it was unnecessary. I removed from the answer.

Davide 2009-12-19 00:21:15

... hmm, no it won't. I think I'm getting it now.

Lasse V. Karlsen 2009-12-19 00:22:02

Or perhaps not, won't multiplying with 2, and 3, get me back into the range of "not supporter by the data type?"

Lasse V. Karlsen 2009-12-19 00:31:57

You are not multiplying by 2 and 3, you are multiplying by 2/7 and 3/7, which are less than 1. So it won't overflow.

Daniel 2009-12-19 01:27:11

Answer 4

+1 A:

Thinking outside the box: Use the median instead. It's much easier to calculate - there are tons of algorithms out there (e.g. using queues), you can often construct good arguments as to why it's more meaningful for data sets (less swayed by extreme values; etc) and you will have zero problems with numerical accuracy. It will be fast and efficient. Plus, for large data sets (which it sounds like you have), unless the distributions are truly weird, the values for the mean and median will be similar.

Peter 2009-12-19 00:21:05

+1 for thinking outside the box, but not +2 (oh how I wish I could award some answers +2 instead of just +1) because you're sidestepping my whole question, and since my question is on the premises that "this is what I want to do, damned if there are other ways to do it", then unfortunately it's not the right answer.

Lasse V. Karlsen 2009-12-19 00:24:47

Answer 5

A:

Average of x_1 .. x_N
    = (Sum(i=1,N,x_i)) / N
    = (Sum(i=1,M,x_i) + Sum(i=M+1,N,x_i)) / N
    = (Sum(i=1,M,x_i)) / N + (Sum(i=M+1,N,x_i)) / N

This can be repeatedly applied, and is true regardless of whether the summations are of equal size. So:

Keep adding terms until both:
- adding another one will overflow (or otherwise lose precision)
- dividing by N will not underflow
Divide the sum by N
Add the result to the average-so-far

There's one obvious awkward case, which is that there are some very small terms at the end of the sequence, such that you run out of values before you satisfy the condition "dividing by N will not underflow". In which case just discard those values - if their contribution to the average cannot be represented in your floating type, then it is in particular smaller than the precision of your average. So it doesn't make any difference to the result whether you include those terms or not.

There are also some less obvious awkward cases to do with loss of precision on individual summations. For example, what's the average of the values:

10^100, 1, -10^100

Mathematics says it's 1, but floating-point arithmetic says it depends what order you add up the terms, and in 4 of the 6 possibilities it's 0, because (10^100) + 1 = 10^100. But I think that the non-commutativity of floating-point arithmetic is a different and more general problem than this question. If sorting the input is out of the question, I think there are things you can do where you maintain lots of accumulators of different magnitudes, and add each new value to whichever one of them will give best precision. But I don't really know.

Steve Jessop 2009-12-19 00:43:55

Answer 6

A:

Some of the mathematical solutions here are very good. Here's a simple technical solution.

Use a larger data type. This breaks down into two possibilities:

Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.

I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple ~~(untested)~~ example in C# follows

class BigMeanSet{
    List<uint> list = new List<uint>();

    public double GetAverage(IEnumerable<uint> values){
        list.Clear();
        list.Add(0);

        uint count = 0;

        foreach(uint value in values){
            Add(0, value);
            count++;
        }

        return DivideBy(count);
    }

    void Add(int listIndex, uint value){
        if((list[listIndex] += value) < value){ // then overflow has ocurred
            if(list.Count == listIndex + 1)
                list.Add(0);
            Add(listIndex + 1, 1);
        }
    }

    double DivideBy(uint count){
        const double shift = 4.0 * 1024 * 1024 * 1024;

        double rtn       = 0;
        long   remainder = 0;

        for(int i = list.Count - 1; i >= 0; i--){
            rtn *= shift;
            remainder <<= 32;
            rtn += Math.DivRem(remainder + list[i], count, out remainder);
        }

        rtn += remainder / (double)count;

        return rtn;
    }
}

Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.

This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 2³² - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.

In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.

~~If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.~~

I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).

I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 2³² - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.

I then tested with 100^* large series, with random length between 10⁵ and 10⁹. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.

^{_{^*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 10⁹ takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.}}

For those interested, my test code is below:

static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
    for(uint i = lowerbound; i <= upperbound; i++)
        yield return i;
}

static void Test(){
    Console.BufferHeight = 1200;
    Random rnd = new Random();

    for(int i = 0; i < 1000; i++){
        uint[] numbers = new uint[rnd.Next(1, 1000)];
        for(int j = 0; j < numbers.Length; j++)
            numbers[j] = (uint)rnd.Next();

        double sum = 0;
        foreach(uint n in numbers)
            sum += n;

        double avg = sum / numbers.Length;
        double ans = new BigMeanSet().GetAverage(numbers);

        Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);

        if(avg != ans)
            Debugger.Break();
    }

    for(int i = 0; i < 100; i++){
        uint length     = (uint)rnd.Next(100000, 1000000001);
        uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
        uint upperbound = lowerbound + length;

        double avg = ((double)lowerbound + upperbound) / 2;
        double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));

        Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);

        if(avg != ans)
            Debugger.Break();
    }
}

P Daddy 2009-12-19 02:41:08

Answer 7

A:

Alok 2009-12-19 17:01:52

Answer 8

+1 A:

Let X be your sample set. Partition it into two sets A and B in any way that you like. Define delta = m_B - m_A where m_S denotes the mean of a set S. Then

m_X = m_A + delta * |B| / |X|

where |S| denotes the cardinality of a set S. Now you can repeatedly apply this to partition and calculate the mean.

Why is this true? Let s = 1 / |A| and t = 1 / |B| and u = 1 / |X| (for convenience of notation) and let aSigma and bSigma denote the sum of the elements in A and B respectively so that:

  m_A + delta * |B| / |X|
= s * aSigma + u * |B| * (t * bSigma - s * aSigma)
= s * aSigma + u * (bSigma - |B| * s * aSigma)
= s * aSigma + u * bSigma - u * |B| * s * aSigma
= s * aSigma * (1 - u * |B|) + u * bSigma
= s * aSigma * (u * |X| - u * |B|) + u * bSigma
= s * u * aSigma * (|X| - |B|) + u * bSigma
= s * u * aSigma * |A| + u * bSigma
= u * aSigma + u * bSigma
= u * (aSigma + bSigma)
= u * (xSigma)
= xSigma / |X|
= m_X

The proof is complete.

From here it is obvious how to use this to either recursively compute a mean (say by repeatedly splitting a set in half) or how to use this to parallelize the computation of the mean of a set.

The well-known on-line algorithm for calculating the mean is just a special case of this. This is the algorithm that if m is the mean of {x_1, x_2, ... , x_n} then the mean of {x_1, x_2, ..., x_n, x_(n+1)} is m + ((x_(n+1) - m)) / (n + 1). So with X = {x_1, x_2, ..., x_(n+1)}, A = {x_(n+1)}, and B = {x_1, x_2, ..., x_n} we recover the on-line algorithm.

Jason 2009-12-19 17:10:27

ansaurus

tags:

views:

answers:

How to reduce calculation of average to sub-sets in a general way?

related questions