I think you're misunderstanding the standard deviation -- if you run your test 50 times and have 50 different runtimes the standard deviation will be a single number that describes how tight or loose those 50 numbers are distributed around your average. In conjunction with your average run time, the standard deviation will help you see how much spread there is in your results.
Consider the following run times:
12 15 16 18 19 21 12 14
The mean of these run times is 15.875
. The sample standard deviation of this set is 3.27. There's a good explanation of what 3.27 actually means (in a normally distributed population, roughly 68% of the samples will fall within one standard deviation of the mean: e.g., between 15.875-3.27
and 15.875+3.27
) but I think you're just looking for a way to quantify how 'tight' or 'spread out' the results are around your mean.
Now consider a different set of run times (say, after you compiled all your tests with -O2
):
14 16 14 17 19 21 12 14
The mean of these run times is also 15.875
. The sample standard deviation of this set is 3.0. (So, roughly 68% of the samples will fall within 15.875-3.0
and 15.875+3.0
.) This set is more closely grouped than the first set.
And you have a single number that summarizes how compact or loose a group of numbers is around the mean.
Caveats
Standard deviation is built on the assumption of a normal distribution -- but your application may not be normally distributed, so please be aware that standard deviation may be a rough guideline at best. Plot your run-times in a histogram to see if your data looks roughly normal or uniform or multimodal or...
Also, I'm using the sample standard deviation because these are only a sample out of the population space of benchmark runs. I'm not a professional statistician, so even this basic assumption may be wrong. Either population standard deviation or sample standard deviation will give you good enough results in your application IFF you stick to either sample or population. Don't mix the two.
I mentioned that the standard deviation in conjunction with the mean will help you understand your data: if the standard deviation is almost as large as your mean, or worse, larger, then your data is very dispersed, and perhaps your process is not very repeatable. Interpreting a 3%
speedup in the face of a large standard deviation is nearly useless, as you've recognized. And the best judge (in my experience) of the magnitude of the standard deviation is the magnitude of the average.
Last note: yes, you can calculate standard deviation by hand, but it is tedious after the first ten or so. Best to use a spreadsheet or wolfram alpha or your handy high-school calculator.