views:

5849

answers:

7

I have a value type that represents a gaussian distribution:

struct Gauss {
    double mean;
    double variance;
}

I would like to perform an integral over a series of these values:

Gauss eulerIntegrate(double dt, Gauss iv, Gauss[] values) {
    Gauss r = iv;
    foreach (Gauss v in values) {
        r += v*dt;
    }
    return r;
}

My question is how to implement addition for these normal distributions.

The multiplication by a scalar (dt) seemed simple enough. But it wasn't simple! Thanks FOOSHNICK for the help:

public static Gauss operator * (Gauss g, double d) {
    return new Gauss(g.mean * d, g.variance * d * d);
}

However, addition eludes me. I assume I can just add the means; it's the variance that's causing me trouble. Either of these definitions seems "logical" to me.

public static Gauss operator + (Gauss a, Gauss b) {
    double mean = a.mean + b.mean;
    // Is it this? (Yes, it is!)
    return new Gauss(mean, a.variance + b.variance);        
    // Or this? (nope)
    //return new Gauss(mean, Math.Max(a.variance, b.variance));
    // Or how about this? (nope)
    //return new Gauss(mean, (a.variance + b.variance)/2);
}

Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?

I suppose I could switch the code to use interval arithmetic instead, but I was hoping to stay in the world of prob and stats.

+5  A: 

The sum of two normal distributions is itself a normal distribution:

N(mean1, variance1) + N(mean2, variance2) ~ N(mean1 + mean2, variance1 + variance2)

This is all on wikipedia page.

Be careful that these really are variances and not standard deviations.

// X + Y
public static Gauss operator + (Gauss a, Gauss b) {
    //NOTE: this is valid if X,Y are independent normal random variables
    return new Gauss(a.mean + b.mean, a.variance + b.variance);
}

// X*b
public static Gauss operator * (Gauss a, double b) {
    return new Gauss(a.mean*b, a.variance*b*b);
}
David Norman
Fantastic, I love that it's exact and not an estimation.
Frank Krueger
I've added code examples.
J.F. Sebastian
+1  A: 

Hah, I thought you couldn't add gaussian distributions together, but you can!

http://mathworld.wolfram.com/NormalSumDistribution.html

In fact, the mean is the sum of the individual distributions, and the variance is the sum of the individual distributions.

MSN

MSN
+2  A: 

Well, your multiplication by scalar is wrong - you should multiply variance by the square of d. If you're adding a constant, then just add it to the mean, the variance stays the same. If you're adding two distributions, then add the means and add the variances.

Other way around: when you multiply a variable by d, you multiply the variance by d squared.
John D. Cook
+1  A: 

I'm not sure that I like what you're calling "integration" over a series of values. Do you mean that word in a calculus sense? Are you trying to do numerical integration? There are other, better ways to do that. Yours doesn't look right to me, let alone optimal.

The Gaussian distribution is a nice, smooth function. I think a nice quadrature approach or Runge-Kutta would be a much better idea.

duffymo
Sure, I would go with Runge-Kutta, but I didn't feel like typing that into a Stack Overflow editor :-)
Frank Krueger
Agreed, but my clairvoyant interface is down. I have no way to know if you type one thing and think another. Don't mislead - a pseudo-code comment like "// 5th order R-K goes here" would tell me much more.
duffymo
Either way, my data is very boring, and the Euler integration above is plenty sufficient. There's nothing wrong with it. I even named it "euler" to try to avoid these kinds of comments. ;-)
Frank Krueger
+2  A: 

Can anyone help define a statistically correct - or at least "reasonable" - version of the + operator?

Arguably not, as adding two distributions means different things - having worked in reliability and maintainablity my first reaction from the title would be the distribution of a system's mtbf, if the mtbf of each part is normally distributed and the system had no redundancy. You are talking about the distribution of the sum of two normally distributed independent variates, not the (logical) sum of two normal distributions' effect. Very often, operator overloading has surprising semantics. I'd leave it as a function and call it 'normalSumDistribution' unless your code has a very specific target audience.

Pete Kirkham
+2  A: 

To be more precise:

If a random variable Z is defined as the linear combination of two uncorrelated Gaussian random variables X and Y, then Z is itself a Gaussian random variable, e.g.:

if Z = aX + bY, then mean(Z) = a * mean(X) + b * mean(Y), and variance(Z) = a2 * variance(X) + b2 * variance(Y).

If the random variables are correlated, then you have to account for that. Variance(X) is defined by the expected value E([X-mean(X)]2). Working this through for Z = aX + bY, we get:

variance(Z) = a2 * variance(X) + b2 * variance(Y) + 2ab * covariance(X,Y)

If you are summing two uncorrelated random variables which do not have Gaussian distributions, then the distribution of the sum is the convolution of the two component distributions.

If you are summing two correlated non-Gaussian random variables, you have to work through the appropriate integrals yourself.

Jason S
+2  A: 

I would have thought it depends on what type of addition you are doing. If you just want to get a normal distribution with properties (mean, standard deviation etc.) equal to the sum of two distributions then the addition of the properties as given in the other answers is fine. This is the assumption used in something like PERT where if a large number of normal probability distributions are added up then the resulting probability distribution is another normal probability distribution.

The problem comes when the two distributions being added are not similar. Take for instance adding a probability distribution with a mean of 2 and standard deviation of 1 and a probability distribution of 10 with a standard deviation of 2. If you add these two distributions up, you get a probability distribution with two peaks, one at 2ish and one at 10ish. The result is therefore not a normal distibution. The assumption about adding distributions is only really valid if the original distributions are either very similar or you have a lot of original distributions so that the peaks and troughs can be evened out.

Ian Turner