views:

565

answers:

4

I created a simple class to benchmark some methods of mine. But is it accurate? I am kind of new to benchmarking, timing, et cetera, so thought I could ask for some feedback here. Also, if it is good, maybe somebody else can make use of it as well :)

public sealed class Benchmark : IEnumerable<long>
{
    private readonly Action subject;
    private Benchmark(Action subject) { this.subject = subject; }

    public static Benchmark This(Action subject)
    {
        return new Benchmark(subject);
    }

    public IEnumerator<long> GetEnumerator()
    {
        var watch = new Stopwatch();
        while (true)
        {
            watch.Reset();
            watch.Start();
            subject();
            watch.Stop();
            yield return watch.ElapsedTicks;
        }
    }

    IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
}

You can use it like this:

var avg = Benchmark.This(() => SomeMethod()).Take(500).Average();

Any feedback? Does it look to be pretty stable and accurate, or have I missed something?

+2  A: 

As I am not a C# programmer, I cannot say with any deal of accuracy whether that class is an appropriate implementation for counting how long a function execution takes. However, there are things to keep in mind for repeatability and accuracy.

I am not up on the various ins and outs of the .NET Framework, but depending on how it compiles to native code, it might be possible that any compilation would affect benchmark results. Also, whether or not a function is in the cache can make a difference, too. So you'll want to loop over your function to ensure that there is no hit from compilation and that everything is loaded and ready. Once that's done, you might be able to get started.

Others will probably have better information and knowledge of .NET than I do.

ZachS
Normally a compiled .Net application isn't a native EXE, but your points are all pretty much valid anyway. An application running in Debug mode in Visual Studio will generally run slower than a .Net EXE started normally, and also functions will generally run slower the first time they're called (especially if they make calls into other Assemblies that have to be loaded), so it makes sense to call a method the first time without including its running time in the average measurement.
MusiGenesis
Good idea, should edit it so that it runs the method once before starting to yield results :)
Svish
+3  A: 

You should definitely return ElapsedMilliseconds instead of ElapsedTicks. The value returned by ElapsedTicks is dependent upon the Stopwatch frequency, which can be different on different systems. It will not necessarily correspond to the Ticks property of a Timespan or DateTime object.

See http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.elapsedticks.aspx.

If you do want the extra resolution of Ticks, you should return watch.Elapsed.Ticks (i.e. Timestamp.Ticks) instead of watch.ElapsedTicks (this might be one of the subtlest potential errors in .Net). From MSDN:

Stopwatch ticks are different from DateTime.Ticks. Each tick in the DateTime.Ticks value represents one 100-nanosecond interval. Each tick in the ElapsedTicks value represents the time interval equal to 1 second divided by the Frequency.

Other than that, I guess your code is fine, although I think you'd be including some of the method-calling overhead in your measurements, which might be significant if the methods themselves take very little time to execute. Also, you probably would want to exclude the first call to the method from your calculated average, but I'm not sure how you'd do that in your class.

One last point, which would probably not be relevant to most uses of this class: Stopwatch runs a bit fast compared to the system time. On my computer, it gets about 5 seconds (that's seconds, not milliseconds) ahead after 24 hours, and on other machines this drift can be even larger. So it's a little misleading to say it's highly accurate, when it's actually just highly granular. For timing short-duration methods, this obviously wouldn't be a significant problem.

And one more last point, which certainly is relevant: I've often noticed while benchmarking that I'll get a bunch of running times that are all clustered within a narrow range of values (e.g. 80, 80, 79, 82 etc.), but occasionally something else will happen in Windows (like opening another program or my anti-virus kicks on or something) and I'll get a value wildly out of whack with the others (e.g. 80, 80, 79, 271, 80 etc.). I think a simple solution to this outlier problem is to use the median of your measurements instead of the mean. I don't know if Linq supports this automatically or not.

MusiGenesis
You have a point about the method-calling overhead. Can't really get away with that here though... I think... Anyways it shouldn't be much and the methods I am using it on now are either so fast that I don't care if it is inaccurate or not, or so long that the method-calling overhead is tiny in comparison :p
Svish
@Svish: the overhead of the method-calling is suprisingly large. To see the difference, trying timing a loop where you multiply two numbers inline vs. passing the two numbers to a method as parameters and multiplying inside that method.
MusiGenesis
+1 for the catch of ElapsedTicks v. Elapsed.Ticks. Also, have you tried using a percentile to weed out the 'edge' cases? e.g. 90% of iterations were at, or below, a threshold. See http://www.techbookreport.com/tutorials/quantiles.html though I cheat and just use Excel. :)
Zach Bonham
@Zach: I learned about ElapsedTicks vs. Elapsed.Ticks the usual way - by getting slapped around on StackOverflow. Personally, I eliminate outliers by employing the "inter-ocular percussion test" (a.k.a. "it hits you right between the eyes"). :)
MusiGenesis
`ElapsedTicks` versus `Elapsed.Ticks` is not an error. They're two different measures that happen to have the same name because of a coincidence. Ticks in a `Stopwatch` context is the number of timer intervals, and is dependent on timer resolution. It is obviously a useful measure to have, as it represents the maximum accuracy you can get. `Ticks` in a `DateTime` context is just a number of 100 nanosecond intervals, which is a commonly used way to represent time intervals on computers.
Joren
@Joren: using ElapsedTicks is not necessarily an error. However, this is a class for benchmarking, so it almost certainly would end up being an error if the user of the class dumped the ElapsedTicks value into a TimeSpan in order to calculate the elapsed duration. I should have said "this might be one of the subtlest *potential* errors in .Net". In fact, I *will* say that.
MusiGenesis
Would it maybe make more sense to just return the Elapsed TimeSpan itself?
Svish
Oh, and also, very good idea to use a median instead of a mean. Not sure if Linq has it, probably not. But shouldn't be hard to create anyways :)
Svish
@Svish: would that work with `yield`? I would just return ElapsedMilliseconds - you're never really going to get sub-millisecond accuracy anyway, no matter how you do this.
MusiGenesis
As long as I changed the return type to TimeSpan I don't see a problem with it. TimeSpan is just a regular struct, as far as I know. And about the accuracy, I don't really need need sub-millisecond accuracy. But I want to see the number. Not because I want to know the exact running time, but because I want to compare it to the running time of other methods. I am using it to compare implementations of the same thing, kind of. If that made sense :p
Svish
+6  A: 

It is about as accurate as you can get for a simple benchmark. But there are some factors not under your control:

  • Load on the system from other processes
  • State of the Heap before/during the benchmark

You could do something about that last point, a benchmark is one of the rare situations where calling GC.Collect() can be defended. And you might call 'subject()' once beforehand to eliminate any JIT issues. But that requires calls to subject() to be independent.

public IEnumerator<TimeSpan> GetEnumerator()
{
    subject();     // warm up
    GC.Collect();  // compact Heap
    GC.WaitForPendingFinalizers(); // and wait for the finalizer queue to empty

    var watch = new Stopwatch();
    while (true)
    {
        watch.Reset();
        watch.Start();
        subject();
        watch.Stop();
        yield return watch.Elapsed;  // TimeSpan
    }
}

For bonus, your class should check the System.Diagnostics.Stopwatch.IsHighResolution property. If it is off you only have a very course (20 ms) resolution.

But on an ordinary PC, with many services running in the background, it is never going to be very accurate.

Henk Holterman
It's not a bad idea to wait for pending finalizers after that collection. Remember, the finalizers run on a different thread; the finalizers for the previous run could be running on another thread while the current run is going.
Eric Lippert
Eric, absolutely. I'll edit.
Henk Holterman
All good ideas. I've implemented them in my class. Only thing I changed was to just return the `TimeSpan` instead of the `ElapsedMilliseconds` or `Ticks` :)
Svish
+7  A: 

Couple problems here.

First, remember that the first time you run the code, the transitive closure of its method calls will be jitted. That means that the first run is likely to have higher cost than every subsequent run. Depending on whether you are benchmarking "cold" timings or "hot" timings, this could make a difference. I have seen methods where the cost of jitting the method was higher than every other call to it put together!

Second, remember that the garbage collector runs on another thread. If you are making garbage in one run, then the cost of cleaning up that garbage might not be realized until suebsequent runs. You are therefore failing to account for the total cost of one run, by foisting it off onto later runs.

Both of these are indicative of the weakness of all benchmarking: benchmarking is by nature unrealistic, and therefore of limited value. In real-world code, the GC is going to be running, the jitter is going to be running, and so on. It is frequently the case that benchmarked performance is nothing at all like real-world performance because the benchmark does not take into account the variability of real-world costs inherent in a large system. Rather than analyzing perf characteristics in isolation, I prefer to look at the perf characteristics of realistic scenarios actually faced by real customers.

Eric Lippert