ansaurus

Question

Answer 1

+9 A:

You can get the mean and standard deviation, but not the median.

new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n

new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2

where n(0) is the number of runs in the first data set, n(1) is the number of runs in the second, and so on, mean is the mean, and var is the variance (which is just standard deviation squared). n**2 means "n squared".

Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,

Var(X) = E(X^2) - E(X)^2

The var(n)+mean(n)**2 terms above give us the E(X^2) portion which we can then combine with other data sets, and then get the desired result.

In terms of medians:

If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.

If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.

Artelius 2009-09-26 07:17:29

Answer 2

+3 A:

Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.

bayer 2009-09-26 07:20:19

Answer 3

+10 A:

Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:

new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n

edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.

comingstorm 2009-09-26 08:10:26

Good point, but could you expand on it a bit?

Artelius 2009-09-26 09:45:03

Sure. The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.

comingstorm 2009-09-26 11:34:36

+1 for your answer and comment. Both are spot on, and very well written.

duffymo 2009-09-26 13:04:46

ansaurus

tags:

views:

answers:

Merging two statistical result sets

related questions