ansaurus

Question

Answer 1

A:

Since you're working with large numbers and then doing floating-point operations on them, you might want to do everything in doubles; that would save you a lot of casts.

Using pow .. 2 to calculate a square seems a bit awkward. You could calculate your number first, then multiply it by itself to get a square.

If you're doing division and feel the need to cast, cast the operands (i.e. the numerator and/or denominator) to double rather than the result. You're losing accuracy if you divide integers.

I'm not sure if your formula for variance is correct. You may want to look at the explanation in Wikipedia, for example. But I'm no math expert either, so I'm not sure you have a mistake.

Carl Smotricz 2009-11-12 12:58:12

There's no compelling reason to do the summing computations in double precision arithmetic; the biggest number is about 10<sup>14</sup> with the input data, which has a factor of ten thousand of safety for the value ranges and data set size quoted.

Jonathan Leffler 2009-11-12 13:12:05

Sho nuff, but there's also no compelling reason not to, especially if you manage to avoid complicating an otherwise simple program.

Carl Smotricz 2009-11-12 14:32:34

Answer 2

+4 A:

Note: It doesn't look like you're calculating the variance.

Variance is calculated by subtracting the mean from every element and calculating the weighted sum of these differences.

So what you need to do is:

// Get mean
double mean = static_cast<double>(value_sum)/size;

// Calculate variance
double variance = 0;
for(int i = 0;i<size;++i) 
{
  variance += (MyArray[i]-mean)*(MyArray[i]-mean)/size;
}

// Display
cout<<variance;

Note that this is the sample variance, and is used when the underlying distribution is unknown (so we assume a uniform distribution).

Also, after some digging around, I found that this is not an unbiased estimator. Wolfram Alpha has something to say about this, but as an example, when MATLAB computes the variance, it returns the "bias-corrected sample variance".

The bias-corrected variance can be obtained by dividing by each element by size-1, or:

//Please check that size > 1
variance += (MyArray[i]-mean)*(MyArray[i]-mean)/(size-1);

Also note that, the value of mean remains the same.

Jacob 2009-11-12 12:58:55

Note that in this case, you could "factor out" the division by size i.e. only divide by size once, after the loop, i.e. The reason formulae have it inside is that it stands for the probability which isn't always constant like here.

mjv 2009-11-12 13:28:50

@Jacob: Thanks! Using 100000 values in the range 0-10000, mean is 4690, I get variance ~8700000. Is that reasonable? What does this value tell me?

sharkin 2009-11-12 13:31:57

Thanks for the expert opinion :) - but when I tried it out on MATLAB, they normalized the SSD with size-1 so I felt it was relevant to mention that here. Also, I decided to divide within the loop to keep the values as small as it has to be, but you're right - I guess it's not required.

Jacob 2009-11-12 13:32:52

@Jacob: Ok, now I get it. Thanks for a great answer!

sharkin 2009-11-12 13:41:08

@R.A - That value seems reasonable and indicates high variance in your data - which means that your distribution of values in your array "varies" a lot from the mean - I generated 10000 random numbers from 0-10000 (using MATLAB), and got a mean of ~5028 and variance of ~8252003

Jacob 2009-11-12 13:41:45

Sure :) - don't forget, in this case variance represents the average squared "distance" between each element in your array from the mean.

Jacob 2009-11-12 13:43:54

@R.A. assuming that your random data is uniformly distributed over [0, 10000] then the mean should be ~5000 = 10000/2 and the variance should be ~8333333 = 10000^2/12 as @Jacob noted. You can see the Wikipedia entry on the uniform distribution for a reference. http://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29

Mark Lavin 2009-11-12 13:48:04

Answer 3

+1 A:

Use a different formula maybe?

#include <functional>
#include <algorithm>
#include <iostream>
int main()
{
 using namespace std;

 vector<double> num( 3 );
 num[ 0 ] = 4000.9, num[ 1 ] = 11111.221, num[ 2 ] = -2;


 double mean = std::accumulate(num.begin(), num.end(), 0.0) / num.size();
 vector<double> diff(num.size());
 std::transform(num.begin(), num.end(), diff.begin(), 
                std::bind2nd(std::minus<double>(), mean));
 double variance = std::inner_product(diff.begin(), diff.end(), 
                                     diff.begin(), 0.0) / (num.size() - 1);
 cout << "mean = " << mean << endl
      << "variance = " << variance << endl;
}

Outputs: mean = 5036.71 variance = 3.16806e+07

dirkgently 2009-11-12 13:01:15

That might be good, but I wouldn't even know where to begin if someone asked me to explain my code..

sharkin 2009-11-12 13:04:50

Same idea as Jacob's, just using STL and `vector`s. Makes life easy :)

dirkgently 2009-11-12 13:06:53

Answer 4

A:

Since variance is the square of the standard deviation, the answers to SO 1174984 should help out. The short diagnosis is that you need to compute the sum of the squares of the values as well as the sum of the values, and you don't seem to be doing that.

Since you have 10⁶ values, and the square of any value can be up to 10⁸, you could end up with a sum of squares up to 10¹⁴; your 64-bit integers can store up to 10¹⁸, so you could still handle ten thousand times as many inputs, or values ranging up to one million instead of only ten thousand, without running into overflows. There's no urgent need, therefore, to move to pure double computations.

Jonathan Leffler 2009-11-12 13:01:36

Answer 5

+1 A:

First of all, if you're just looking to get a handle on what is a "reasonable" variance, keep in mind that variance is basically standard deviation squared. Standard deviation roughly measures the typical distance from a data point to its expected value.

So if your data has mean 4692, and your calculated variance is coming out to 1483780, that means your standard deviation is about 1218, which would suggest your numbers tend to be somewhere in the vicinity of the range 3474 - 5910. So that variance actually seems a bit low to me if the range of your numbers is 0 - 10000; but it obviously depends on the distribution of your data.

As for the calculation itself: You can calculate the variance using a running calculation as you're reading your data the first time around (you don't have to know the mean in advance) using Welford's Method:

Initialize M1 = x1 and S1 = 0.

For subsequent x's, use the recurrence formulas

Mk = Mk-1+ (xk - Mk-1)/k Sk = Sk-1 + (xk - Mk-1)*(xk - Mk).

For 2 ≤ k ≤ n, the kth estimate of the variance is s2 = Sk/(k - 1).

Dan Tao 2009-11-12 13:18:14

Answer 6

A:

Just for fun, a slightly different route to the same result, using std::valarray instead of std::vector and (various) algorithms:

template <class T>
T const variance(std::valarray<T> const &v) {
    if (v.size() == 0)
     return T(0.0);
    T average = v.sum() / v.size();
    std::valarray<T> diffs = v-average;
    diffs *= diffs;
    return diffs.sum()/diffs.size();
}

As Jacob hinted, there are really two possible versions of a variance calculation. As it stands, this assumes your inputs are the "universe". If you've taken only a sample of the overall universe, the last line should use: (diffs.size()-1) instead of diffs.size().

Jerry Coffin 2009-11-12 15:32:06

ansaurus

tags:

views:

answers:

Calculating variance with large numbers

related questions