views:

540

answers:

6

I haven't really used variance calculation that much, and I don't know quite what to expect. Actually I'm not too good with math at all.

I have a an array of 1000000 random numeric values in the range 0-10000.

The array could grow even larger, so I use 64 bit int for sum.

I have tried to find code on how to calc variance, but I don't know if I get correct output.

The mean is 4692 and median is 4533. I get variance 1483780.469308 using the following code:

// size is the element count, in this case 1000000
// value_sum is __int64

double p2 = pow( (double)(value_sum - (value_sum/size)), (double)2.0 );
double variance = sqrt( (double)(p2 / (size-1)) );

Am I getting a reasonable value?

Is anything wrong with the calculation?

A: 

Since you're working with large numbers and then doing floating-point operations on them, you might want to do everything in doubles; that would save you a lot of casts.

Using pow .. 2 to calculate a square seems a bit awkward. You could calculate your number first, then multiply it by itself to get a square.

If you're doing division and feel the need to cast, cast the operands (i.e. the numerator and/or denominator) to double rather than the result. You're losing accuracy if you divide integers.

I'm not sure if your formula for variance is correct. You may want to look at the explanation in Wikipedia, for example. But I'm no math expert either, so I'm not sure you have a mistake.

Carl Smotricz
There's no compelling reason to do the summing computations in double precision arithmetic; the biggest number is about 10<sup>14</sup> with the input data, which has a factor of ten thousand of safety for the value ranges and data set size quoted.
Jonathan Leffler
Sho nuff, but there's also no compelling reason not to, especially if you manage to avoid complicating an otherwise simple program.
Carl Smotricz
+4  A: 

Note: It doesn't look like you're calculating the variance.

Variance is calculated by subtracting the mean from every element and calculating the weighted sum of these differences.

So what you need to do is:

// Get mean
double mean = static_cast<double>(value_sum)/size;

// Calculate variance
double variance = 0;
for(int i = 0;i<size;++i) 
{
  variance += (MyArray[i]-mean)*(MyArray[i]-mean)/size;
}

// Display
cout<<variance;

Note that this is the sample variance, and is used when the underlying distribution is unknown (so we assume a uniform distribution).

Also, after some digging around, I found that this is not an unbiased estimator. Wolfram Alpha has something to say about this, but as an example, when MATLAB computes the variance, it returns the "bias-corrected sample variance".

The bias-corrected variance can be obtained by dividing by each element by size-1, or:

//Please check that size > 1
variance += (MyArray[i]-mean)*(MyArray[i]-mean)/(size-1);

Also note that, the value of mean remains the same.

Jacob
Note that in this case, you could "factor out" the division by size i.e. only divide by size once, after the loop, i.e. The reason formulae have it inside is that it stands for the probability which isn't always constant like here.
mjv
@Jacob: Thanks! Using 100000 values in the range 0-10000, mean is 4690, I get variance ~8700000. Is that reasonable? What does this value tell me?
sharkin
Thanks for the expert opinion :) - but when I tried it out on MATLAB, they normalized the SSD with size-1 so I felt it was relevant to mention that here. Also, I decided to divide within the loop to keep the values as small as it has to be, but you're right - I guess it's not required.
Jacob
@Jacob: Ok, now I get it. Thanks for a great answer!
sharkin
@R.A - That value seems reasonable and indicates high variance in your data - which means that your distribution of values in your array "varies" a lot from the mean - I generated 10000 random numbers from 0-10000 (using MATLAB), and got a mean of ~5028 and variance of ~8252003
Jacob
Sure :) - don't forget, in this case variance represents the average squared "distance" between each element in your array from the mean.
Jacob
@R.A. assuming that your random data is uniformly distributed over [0, 10000] then the mean should be ~5000 = 10000/2 and the variance should be ~8333333 = 10000^2/12 as @Jacob noted. You can see the Wikipedia entry on the uniform distribution for a reference. http://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29
Mark Lavin
+1  A: 

Use a different formula maybe?

#include <functional>
#include <algorithm>
#include <iostream>
int main()
{
 using namespace std;

 vector<double> num( 3 );
 num[ 0 ] = 4000.9, num[ 1 ] = 11111.221, num[ 2 ] = -2;


 double mean = std::accumulate(num.begin(), num.end(), 0.0) / num.size();
 vector<double> diff(num.size());
 std::transform(num.begin(), num.end(), diff.begin(), 
                std::bind2nd(std::minus<double>(), mean));
 double variance = std::inner_product(diff.begin(), diff.end(), 
                                     diff.begin(), 0.0) / (num.size() - 1);
 cout << "mean = " << mean << endl
      << "variance = " << variance << endl;
}

Outputs: mean = 5036.71 variance = 3.16806e+07

dirkgently
That might be good, but I wouldn't even know where to begin if someone asked me to explain my code..
sharkin
Same idea as Jacob's, just using STL and `vector`s. Makes life easy :)
dirkgently
A: 

Since variance is the square of the standard deviation, the answers to SO 1174984 should help out. The short diagnosis is that you need to compute the sum of the squares of the values as well as the sum of the values, and you don't seem to be doing that.

Since you have 106 values, and the square of any value can be up to 108, you could end up with a sum of squares up to 1014; your 64-bit integers can store up to 1018, so you could still handle ten thousand times as many inputs, or values ranging up to one million instead of only ten thousand, without running into overflows. There's no urgent need, therefore, to move to pure double computations.

Jonathan Leffler
+1  A: 

First of all, if you're just looking to get a handle on what is a "reasonable" variance, keep in mind that variance is basically standard deviation squared. Standard deviation roughly measures the typical distance from a data point to its expected value.

So if your data has mean 4692, and your calculated variance is coming out to 1483780, that means your standard deviation is about 1218, which would suggest your numbers tend to be somewhere in the vicinity of the range 3474 - 5910. So that variance actually seems a bit low to me if the range of your numbers is 0 - 10000; but it obviously depends on the distribution of your data.

As for the calculation itself: You can calculate the variance using a running calculation as you're reading your data the first time around (you don't have to know the mean in advance) using Welford's Method:

Initialize M1 = x1 and S1 = 0.

For subsequent x's, use the recurrence formulas

Mk = Mk-1+ (xk - Mk-1)/k Sk = Sk-1 + (xk - Mk-1)*(xk - Mk).

For 2 ≤ k ≤ n, the kth estimate of the variance is s2 = Sk/(k - 1).

Dan Tao
A: 

Just for fun, a slightly different route to the same result, using std::valarray instead of std::vector and (various) algorithms:

template <class T>
T const variance(std::valarray<T> const &v) {
    if (v.size() == 0)
     return T(0.0);
    T average = v.sum() / v.size();
    std::valarray<T> diffs = v-average;
    diffs *= diffs;
    return diffs.sum()/diffs.size();
}

As Jacob hinted, there are really two possible versions of a variance calculation. As it stands, this assumes your inputs are the "universe". If you've taken only a sample of the overall universe, the last line should use: (diffs.size()-1) instead of diffs.size().

Jerry Coffin