views:

278

answers:

3

I have two sets of statistics generated from processing. The data from the processing can be a large amount of results so I would rather not have to store all of the data to recalculate the additional data later on.

Say I have two sets of statistics that describe two different sessions of runs over a process.

Each set contains

Statistics : { mean, median, standard deviation, runs on process}

How would I merge the two's median, and standard deviation to get a combined summary of the two describing sets of statistics.

Remember, I can't preserve both sets of data that the statistics are describing.

+9  A: 

You can get the mean and standard deviation, but not the median.

new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n

new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2

where n(0) is the number of runs in the first data set, n(1) is the number of runs in the second, and so on, mean is the mean, and var is the variance (which is just standard deviation squared). n**2 means "n squared".

Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,

Var(X) = E(X^2) - E(X)^2

The var(n)+mean(n)**2 terms above give us the E(X^2) portion which we can then combine with other data sets, and then get the desired result.

In terms of medians:

If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.

If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.

Artelius
+3  A: 

Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.

bayer
+10  A: 

Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:

new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n

edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.

comingstorm
Good point, but could you expand on it a bit?
Artelius
Sure. The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.
comingstorm
+1 for your answer and comment. Both are spot on, and very well written.
duffymo