views:

161

answers:

5

(Not strictly programming, but a question that programmers need answered.)

I have a benchmark, X, which is made up of a lot of sub-benchmarks x1..xn. Its quite a noisy test, with the results being quite variable. To accurately benchmark, I must reduce that "variability", which requires that I first measure the variability.

I can easily calculate the variability of each sub-benchmark, using perhaps standard deviation or variance. However, I'd like to get a single number which represents the overall variability as a single number.

My own attempt at the problem is:

sum = 0
foreach i in 1..n
   calculate mean across the 60 runs of x_i
   foreach j in 1..60
       sum += abs(mean[i] - x_i[j])
variability = sum / 60
+1  A: 

I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.

Consider the following run times:

12 15 16 18 19 21 12 14

The mean of these run times is 15.875. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27 and 15.875+3.27) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.

Now consider a different set of run times (say, after you compiled all your tests with -O2):

14 16 14 17 19 21 12 14

The mean of these run times is also 15.875. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0 and 15.875+3.0.) This set is more closely grouped than the first set.

And you have a single number that summarizes how compact or loose a group of numbers is around the mean.

Caveats

Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...

Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.

I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3% speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.

Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.

sarnold
I do understand the standard deviation. However, my problem is that I have 26 distributions, and want to combine them into a single handy number. I explained the full problem because I don't have a deep enough understanding of stats to know whether combining a set of standard deviations has any meaning.
Paul Biggar
@Paul, Okay, now I get it. I thought that you were unsure how to convert the times of multiple runs of *X* into a measure of variability of *X*. But you want to combine the variability of dozens of sub-tests into something more meaningful than the variability of *X*. And _that_ is definitely beyond my stats too.
sarnold
OK, thanks. I rewrote the question to more carefully describe what I needed. Thanks for the help.
Paul Biggar
A: 

You are trying to solve the wrong problem. Better try to minimize it. The differences can be because of caching.

Try running the code on a single (same) core with SetThreadAffinityMask() function on Windows.

Drop the first measurement.

Increase the thead priority.

Stop hyperthreading.

If you have many conditional jumps it can introduce visible differences between calls with different input. (this could be solved by giving exactly the same input for i-th iteration, and then comparing the measured times between these iterations).

You can find here some useful hints: http://www.agner.org/optimize/optimizing_cpp.pdf

ruslik
How can I minimize it if I can't measure it?
Paul Biggar
I mean variability. As I understood, you want to be able to see small improvements of speed, but because of big variance of the measurements you cannot do it (noise >> signal strength). So you've decided to find the best way to measure the average. My point was that you'll better try using these hints to minimize the variability (or noise), so that you just won't need sophisticated ways to compute the average.
ruslik
No, I'm trying to measure the variability, so that I can reduce it.
Paul Biggar
All the hints above are ways to reduse the variability.
ruslik
Right, but I need to measure it to know if my attempts to reduce it are successful.
Paul Biggar
+2  A: 

Best idea: ask at the statistics Stack Exchange once it hits public beta (in a week).

In the meantime: you might actually be more interested in the extremes of variability, rather than the central tendency (mean, etc.). For many applications, I imagine that there's relatively little to be gained by incrementing the typical user experience, but much to be gained by improving the worst user experiences. Try the 95th percentile of the standard deviations and work on reducing that. Alternatively, if the typical variability is what you want to reduce, plot the standard deviations all together. If they're approximately normally distributed, I don't know of any reason why you couldn't just take the mean.

Matt Parker
Looks like that is the best idea. Here's the question: http://stats.stackexchange.com/questions/1895/determining-the-variability-of-a-benchmark
Paul Biggar
+1  A: 

From Variance: "the variance of the total group is equal to the mean of the variances of the subgroups, plus the variance of the means of the subgroups." I had to read that several times, then run it: 464 from this formula == 464, the standard deviation of all the data -- the single number you want.

#!/usr/bin/env python
import sys
import numpy as np

N = 10
exec "\n".join( sys.argv[1:] )  # this.py N= ...
np.set_printoptions( 1, threshold=100, suppress=True )  # .1f
np.random.seed(1)

data = np.random.exponential( size=( N, 60 )) ** 5  # N rows, 60 cols
row_avs = np.mean( data, axis=-1 )  # av of each row
row_devs = np.std( data, axis=-1 )  # spread, stddev, of each row about its av
print "row averages:", row_avs
print "row spreads:", row_devs
print "average row spread: %.3g" % np.mean( row_devs )

# http://en.wikipedia.org/wiki/Variance:
# variance of the total group
# = mean of the variances of the subgroups  +  variance of the means of the subgroups
avvar = np.mean( row_devs ** 2 )
varavs = np.var( row_avs )
print "sqrt total variance: %.3g = sqrt( av var %.3g + var avs %.3g )" % (
    np.sqrt( avvar + varavs ), avvar, varavs)

var_all = np.var( data )  # std^2 all N x 60 about the av of the lot
print "sqrt variance all: %.3g" % np.sqrt( var_all )

row averages: [  49.6  151.4   58.1   35.7   59.7   48.   115.6   69.4  148.1   25. ]
row devs: [ 244.7  932.1  251.5   76.9  201.1  280.   513.7  295.9  798.9  159.3]
average row dev: 375
sqrt total variance: 464 = sqrt( av var 2.13e+05 + var avs 1.88e+03 )
sqrt variance all: 464


To see how group variance increases, run the example in Wikipedia Variance. Say we have

60 men of heights 180 +- 10, exactly 30: 170 and 30: 190  
60 women of heights 160 +- 7, 30: 153 and 30: 167.  

The average standard dev is (10 + 7) / 2 = 8.5 . Together though, the heights

-------|||----------|||-|||-----------------|||---
       153          167 170                 190

spread like 170 +- 13.2, much greater than 170 +- 8.5.
Why ? Because we have not only the spreads men +- 10 and women +- 7, but also the spreads from 160 / 180 about the common mean 170.
Exercise: compute the spread 13.2 in two ways, from the formula above, and directly.

Denis
Thanks, great answer., I took a look at the Wikipedia page, and I'm not sure. That quote is gated by: "Suppose that the observations can be partitioned into equal-sized subgroups according to some second variable." Is that what I have with benchmarks? Is each "sub-benchmark" a subgroup in this sense? I would think that Bienaymé formula may be closer: "variance of the sum of uncorrelated random variables is the sum of their variances". So the question is are they independent. I would think yes, based on an inability to predict one sub-benchmark's results from the other sub-benchmark. Thoughts?
Paul Biggar
@Paul, yes you have subgroups,as in the men/women example (improved a bit).I'd suggest plotting your data, N horizontal lines w 60 dots each,to *see* how they scatter;pencil and paper, some of the data, would suffice.On "variance of the sum ...": we're not adding numbers anywhere ?(And if the answer is useful, click "accept".)
Denis
A: 

This is a tricky problem because benchmarks can be of different natural lengths anyway. So, the first thing you need to do is to convert each of the individual sub-benchmark figures into scale-invariant values (e.g., “speed up factor” relative to some believed-good baseline) so that you at least have a chance to compare different benchmarks.

Then you need to pick a way to combine the figures. Some sort of average. There are, however, many types of average. We can reject the use of the mode and the median here; they throw away too much relevant information. But the different kinds of mean are useful because of the different ways they give weight to outliers. I used to know (but have forgotten) whether it was the geometric mean or the harmonic mean that was most useful in practice (the arithmetic mean is less good here). The geometric mean is basically an arithmetic mean in the log-domain, and a harmonic mean is similarly an arithmetic mean in the reciprocal-domain. (Spreadsheets make this trivial.)

Now that you have a means to combine the values for a run of the benchmark suite into something suitably informative, you've then got to do lots of runs. You might want to have the computer do that while you get on with some other task. :-) Then try combining the values in various ways. In particular, look at the variance of the individual sub-benchmarks and the variance of the combined benchmark number. Also consider doing some of the analyses in the log and reciprocal domains.

Be aware that this is a slow business that is difficult to get right and it's usually uninformative to boot. A benchmark only does performance testing of exactly what's in the benchmark, and that's mostly not how people use the code. It's probably best to consider strictly time-boxing your benchmarking work and instead focus on whether users think the software is perceived as fast enough or whether required transaction rates are actually attained in deployment (there are many non-programming ways to screw things up).

Good luck!

Donal Fellows