views:

152

answers:

5

I have a series of functions that are all designed to do the same thing. The same inputs produce the same outputs, but the time that it takes to do them varies by function. I want to determine which one is 'fastest', and I want to have some confidence that my measurement is 'statistically significant'.

Perusing Wikipedia and the interwebs tells me that statistical significance means that a measurement or group of measurements is different from a null hypothesis by a p-value threshold. How would that apply here? What is the null hypothesis between function A being faster than function B?

Once I've got that whole setup defined, how do I figure out when to stop measuring? I'll typically see that a benchmark is run three times, and then the average is reported; why three times and not five or seven? According to this page on Statistical Significance (which I freely admit I do not understand fully), Fisher used 8 as the number of samples that he needed to measure something with 98% confidence; why 8?

A: 

The fundamental question you're trying to answer is how likley is it that what you observe could have happened by chance? Is this coin fair? Throw it once: HEADS. No it's not fair it always comes down heads. Bad conclusion! Throw it 10 times and get 7 Heads, now what do you conclude? 1000 times and 700 heads?

For simple cases we can imagine how to figure out when to stop testing. But you have a slightly different situation - are you really doing a statistical analysis?

How much control do you have of your tests? Does repeating them add any value? Your computer is deterministic (maybe). Eistein's definition of insanity is to repeat something and expect a different outcome. So when you run your tests do you get repeatable answers? I'm not sure that statistical analyses help if you are doing good enough tests.

For what you're doing I would say that the first key thing is to make sure that you really are measuring what you think. Run every test for long enough that any startup or shutdown effects are hidden. Useful performance tests tend to run for quite extended periods for that reason. Make sure that you are not actually measuing the time in your test harness rather than the time in your code.

You have two primary variables: how many iterations of your method to run in one test? How many tests to run?

Wikipedia says this

In addition to expressing the variability of a population, standard deviation is commonly used to measure confidence in statistical conclusions. For example, the margin of error in polling data is determined by calculating the expected standard deviation in the results if the same poll were to be conducted multiple times. The reported margin of error is typically about twice the standard deviation.

Hence if your objective is to be sure that one function is faster than another you could run a number of tests of each, compute the means and standard deviations. My expectation is that if your number of iterations within any one test is high then the standard deviation is going to be low.

If we accept that defintion of margin of error, you can see whether the two means are further apart than their total margin's of error.

djna
Every time I run a test, I get a slightly different number for the speed. That's just the way benchmarking goes-- a modern computer environment is not controlled, as someone else already pointed out. So, running the same test twice will give different speed answers, but not different results.
mmr
I've updated the answer to suggest looking at standard deviation.
djna
+1  A: 

Do you really care about statistical significance or plain old significance? Ultimately you're likely to have to form a judgement about readability vs performance - and statistical significance isn't really going to help you there.

A couple of rules of thumb I use:

  • Where possible, test for enough time to make you confident that little blips (like something else interrupting your test for a short time) won't make much difference. Usually I reckon 30 seconds is enough for this, although it depends on your app. The longer you test for, the more reliable the test will be - but obviously your results will be delayed :)

  • Running a test multiple times can be useful, but if you're timing for long enough then it's not as important IMO. It would alleviate other forms of error which made a whole test take longer than it should. If a test result looks suspicious, certainly run it again. If you see significantly different results for different runs, run it several more times and try to spot a pattern.

Jon Skeet
I really care about statistical significance; I want to state, with numerical confidence, that this function or approach is faster than that approach on a given computer setup. For your tests, why 30 seconds? Where does that number come from? Intuition based on experience?And you say 'run it several more times'-- how many more times? Is there some formula, or just a back-of-the-envelope calculation?
mmr
If the difference isn't *absolutely obvious* then just go with the more readable version. That's almost *always* the best way to go. As for 30 seconds - yes, intuition based on experience. As for "how many more times" - until you've got a set of figures which look reasonable. All just based on gut feeling.
Jon Skeet
+1 about absolutely obvious differences. In statistics, we called this the "inter-ocular percussion test" (aka "it hits you right between the eyes").
MusiGenesis
A: 

The research you site sounds more like a highly controlled environment. This is purely a practical answer that has proven itself time and again to be effective for performance testing.

If you are benchmarking code in a modern, multi-tasking, multi-core, computing environment, the number of iterations required to achieve a useful benchmark goes up as the length of time of the operation to be measured goes down.

So, if you have an operation that takes ~5 seconds, you'll want, typically, 10 to 20 iterations. As long as the deviation across the iterations remains fairly constant, then your data is sound enough to draw conclusions. You'll often want to throw out the first iteration or two because the system is typically warming up caches, etc...

If you are testing something in the millisecond range, you'll want 10s of thousands of iterations. This will eliminate noise caused by other processes, etc, firing up.

Once you hit the sub-millisecond range -- 10s of nanoseconds -- you'll want millions of iterations.

Not exactly scientific, but neither is testing "in the real world" on a modern computing system.

When comparing the results, consider the difference in execution speed as percentage, not absolute. Anything less than about 5% difference is pretty close to noise.

bbum
but why 10 to 20? Where are those numbers coming from? Is there some formula, or are you just guessing? Why 5% for noise, and not, say, something related to the standard deviations of the speed?
mmr
+4  A: 

I would not bother applying statistics principles to benchmarking results. In general, the term "statistical significance" refers to the likelihood that your results were achieved accidentally, and do not represent an accurate assessment of the true values. In statistics, as a result of simple probability, the likelihood of a result being achieved by chance decreases as the number of measurements increases. In the benchmarking of computer code, it is a trivial matter to increase the number of trials (the "n" in statistics) so that the likelihood of an accidental result is below any arbitrary threshold you care to define (the "alpha" or level of statistical significance).

To simplify: benchmark by running your code a huge number of times, and don't worry about statistical measurements.

Note to potential down-voters of this answer: this answer is somewhat of a simplification of the matter, designed to illustrate the concepts in an accessible way. Comments like "you clearly don't understand statistics" will result in a savage beat-down. Remember to be polite.

MusiGenesis
But by how much does the "likelihood of a result being achieved by chance decreases as the number of measurements increases"? Where's the acceptable point of saying, OK, the likelihood is pretty low now-- call it 0.5% chance that my results are good? Or that this routine is X% faster than another routine, and my confidence in X% is 99%?
mmr
@mmr: if you have two routines (A and B), and you run them each 1,000,000 times, and the average time for A is 1 ms and the average time for B is 2ms, then the statistics question is: "given the *assumption* that A and B in reality take the same amount of time, what is the probability that B *purely by chance* measured twice as long as A?" The answer is: "so damn near 0 that you might as well say 0".
MusiGenesis
@mmr: statistical methods in general are designed for approximating measurements of uncountably large populations by measuring relatively small samples and extrapolating from there. In computer code benchmarking, you are not in any practical way limited to measuring small samples, so statistical methodologies are not necessary.
MusiGenesis
@MusiGeneis: Sure, but you've just run it a million times. Maybe you could make the same statement about being 'so damn near zero' at 1000 runs. Where's that threshold? And what's 'so damn near zero'? 0.001?
mmr
@MusiGenesis #2: Why isn't it necessary? Why isn't this like any other scientific measurement?
mmr
@mmr: standard alpha values in statistics are .05 and .01. I'm talking more like .00000001, orders of magnitude below any reasonable alpha value.
MusiGenesis
@mmr: as I mentioned before, statistical methodologies were designed for estimating the properties of populations that are too large to be counted directly. Computer code does not suffer from this problem.
MusiGenesis
@MusiGenesis: Now we're getting to what I want to know :) What's an alpha value? How do you determine it? And as for 'not suffering from this problem', my problem is that these routines take on the order of hours to run (it's some heavy lifting image processing code that I'm trying to optimize), so I can't reasonably run the code a million times and still finish this month.
mmr
@mmr: the *alpha* value is basically how willing you are to make a Type I error, i.e. how willing you are to reject a null hypothesis that is actually true. In my A and B example above, choosing an alpha of .05 would mean that when running comparison tests like this, you're willing to be wrong (in other words, to say B is slower than A when it actually isn't) 5% of the time.
MusiGenesis
@mmr: if each routine takes hours to run, that's sort of a different matter. I'm guessing that your routine actually involves one or more smaller pieces of code that get called over and over again. If this is the case, you're better off setting up benchmark tests of these smaller subroutines, and comparing those.
MusiGenesis
To further complicate this issue, since you want to know generally which methods are better *when run on a variety of different machines/environments,* your "n" must refer to the number of different machines you run it on, in order to properly generalize your results. When run on just your machine, you're really only estimating the relative speeds on that machine.
MusiGenesis
@MusiGenesis: So if there's a Type I error, what's the null hypothesis of comparing two functions? That they are the same speed? Or that A is faster than B? Regardless of the _time_ of execution, I still want to be able to say (for political reasons that are long and boring) that A is faster than B, or vice versa, with some numerical confidence description. When I look at Type I errors on Wikipedia, I still don't get a sense of how many measurements I'd need, which is the ultimate question here.
mmr
@mmr: in this case, the null hypothesis is that the two functions are the same speed. If you then used a one-tailed statistical test, you could say that B is faster than A (with a .01 alpha or whatever), or if you used a two-tailed statistical test, you could say that B was *different* from A (either faster or slower). You would probably only use a one-tailed test if you had an *a priori* reason to think that B would be faster, rather than just different.
MusiGenesis
@mmr: you can see, given the number of choices available to you for something like this, how it is generally the case that the generalizability of statistical results are often severely overstated (not to mention how easily they can be manipulated to produce the desired conclusions).
MusiGenesis
@mmr: one additional point - the formulas for calculating statistical significance always incorporate the *variance* in the measurements. If the variance is high (meaning the measurements are dissimilar e.g. 2, 1, 7, 25, 8, 50 etc.), then it is harder to reject the null hypothesis. If the variance is low (e.g. 2, 2, 2, 3, 2, 2) it is easier to reject null. In layman's terms, more consistence in the data indicates greater reliability of the estimate.
MusiGenesis
@mmr: this is why you can't simply say you need some arbitrary number of measurements (like 8, for example) to achieve a given confidence level. The necessary n value is dependent upon the sample variance.
MusiGenesis
@mmr: final point (for real this time) - computer benchmark measurements generally have an incredibly low variance. A given set of instructions take almost exactly the same length of time every time they're run. So in addition to computers being able to easily generate a large n, they only *need* a small n.
MusiGenesis
+3  A: 

You are asking two questions:

  1. How do you perform a test of statistical significance that the mean time of function A is greater than the mean time of function B?
  2. If you want a certain confidence in your answer, how many samples should you take?

The most common answer to the first question is that you either want to compute a confidence interval or perform a t-test. It's not different than any other scientific experiment with random variation. To compute the 95% confidence interval of the mean response time for function A simply take the mean and add 1.96 times the standard error to either side. The standard error is the square root of the variance divided by N. That is,

95% CI = mean +/- 1.96 * sqrt(sigma2/N))

where sigma2 is the variance of speed for function A and N is the number of runs you used to calculate mean and variance.

Your second question relates to statistical power analysis and the design of experiments. You describe a sequential setup where you are asking whether to continue sampling. The design of sequential experiments is actually a very tricky problem in statistics, since in general you are not allowed to calculate confidence intervals or p-values and then draw additional samples conditional on not reaching your desired significance. If you wish to do this, it would be wiser to set up a Bayesian model and calculate your posterior probability that speed A is greater than speed B. This, however, is massive overkill.

In a computing environment it is generally pretty trivial to achieve a very small confidence interval both because drawing large N is easy and because the variance is generally small -- one function obviously wins.

Given that Wikipedia and most online sources are still horrible when it comes to statistics, I recommend buying Introductory Statistics with R. You will learn both the statistics and the tools to apply what you learn.

Tristan
Thanks for the reference! But let's suppose that I've taken N samples already. Could I calculate CI's for any three runs sampled from N, and then any 4, and then any 5, etc, up to N, to determine if I've already hit a threshold by the time I get to N? From what you're saying, "You are not allowed to calculate confidence intervals or p-values..." but if I've already gathered the data, is that OK? Why can't I just check the data I have to see if I need to keep going?
mmr
It's perfectly fine to do whatever you want to some pre-test data. You can play around and see what N gives you the type of confidence interval you desire. However when you compute your final confidence interval, you can't base it on a sample where N is a function of characteristics of that particular sample (i.e. mean of N-1 previous observations).The question to keep in mind is: would the procedure I use to choose N lead to a result equivalent to N random draws? For instance, increasing N until the CI is smaller than 1 always create CI less than 1, whereas sampling randomly N won't.
Tristan