How to accurately measure clock cycles used by a c++ function?

views:

1252

answers:

+1 Q:

How to accurately measure clock cycles used by a c++ function?

I know that I have to use: rdtsc. The measured function is deterministic but the result is far from being repeatable (I get 5% oscillations from run to run). Possible causes are:

context switching
cache misses

Do you know any other causes? How to eliminate them?

+2 A:

See the question http://stackoverflow.com/questions/410437/is-stopwatch-benchmarking-acceptable for a discourse on the variance of micro-benchmarking on a modern multi-core multi-threading multi-process machine.

Although the question is about Java, the considerations apply to benchmarking in any language.

Also see: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java

Alse see: http://stackoverflow.com/questions/324200/what-advice-can-you-give-me-for-writing-a-meaningful-benchmark

Software Monkey 2009-02-03 01:39:42

+1 A:

Adding to the list of causes: branch prediction/misprediction (this can be effected by a context switch with complex prediction caches on some chips. Also prediction can be affected by different inputs to your program and so direct comparison between the time for two different data sets can be slightly skewed.

In general, it is nigh impossible to mitigate all of these, but there are some things you can do to help each one:

Cache miss: "Prime" the cache before beginning timing. Don't forget there is an instruction cache which also needs to be primed. For small data sets just run the entire test once without timing, then run it again with timing. For large data sets do this, but then use the processor's precache instruction to load the first data block back in to the cache.
Context switch: Use a multi-processor/core chip on a lightly loaded system and set the process's affinity to a specific CPU (preferably not CPU 0). This will also help with cache misses (since moving CPUs means the cache is completely lost) and branch prediction (since really it is a form of cache).

But, of course, by far the best method of doing timings like this is to do them very many times on very large pieces of data such that the variability introduced by things you can't control is minimized. (It can never really be erased.)

SoapBox 2009-02-03 02:01:05

If you're measuring clock cycles, why would you want to eliminate cache misses and branch mispredictions from your benchmark? Those will occur "in the wild" as well, so it's important to view the entire distribution and not just the optimal case.

Tom 2009-02-22 23:45:00

+1 A:

TSCs (what rdtsc uses) are often not synchronized on multi-processor systems. It may help to set the CPU affinity in order to bind the process to a single CPU.

You could also get timestamps from HPET timers if available, which aren't prone to the same problem.

As for repeatability, those variances are true. You could disable caching, give a realtime priority to the process and/or (if on Linux or something similar) recompile your kernel with a lower, fixed timer interrupt frequency (the one that does time-slicing). You can't eliminate the variance completely, at least not easily and not on regular CPU + OS combos.

In general, for easy coding, reliability and portability reasons, I suggest you use what the OS has to offer. If it offers high-precision timers, use the appropriate OS helper.

(Just in case you're trying a time attack on a crypto system, well, you're going to have to live with 1. this randomness and 2. general defenses that make the system unpredictable for good reasons, so the function might not be deterministic with respect to time.)

EDIT: added paragraph about timers the OS can offer.

EDIT: This refers to Linux. For binding a process to a single CPU (to have an accurate read from RDTSC), you can use sched_setaffinity(2). And here's some code from one of my projects using it for some other purpose (mapping threads to CPUs). This should be your first try. As for HPETs, you can use regular POSIX calls like these, as long as the kernel and the machine support these timers.

Eduard - Gabriel Munteanu 2009-02-03 02:05:25

Thanks. How to bind process to one processor? Do you have any example for HPET in use?

Łukasz Lew 2009-02-06 17:46:20

@Łukasz Lew: I've edited my post and added the details.

Eduard - Gabriel Munteanu 2009-02-07 04:38:49

@Łukasz Lew: Clarified the best approach and HPETs in the last paragraph.

Eduard - Gabriel Munteanu 2009-02-07 04:51:28

+2 A:

Why eliminate them? Sounds like you've created a realistic benchmark. That code would have the same kind of variability when used in the wild as well. Probably worse since you're likely to have eliminated disk and CPU cache latencies. Using Jon Skeet's approach, creating conditions that gives you the best possible result, will only leave you with a result that makes you feel good but is never achievable.

If an absolute number is important, calculate the median, not the average.

Hans Passant 2009-02-03 02:06:19

median seems robust. But I can't ignore the cases when the benchmark takes forever.

Łukasz Lew 2009-02-06 17:42:57

Most modern processors support a remarkable set of low-level hardware performance counters. If you really want to know the answers, including real measurements on cache misses and context switching overhead, grab the PAPI (Performance API) toolkit, then, on some (though not all) OSes install one kernel patch, and, with some additional effort, you're off and running.

Liudvikas Bukys 2009-02-03 02:13:06

ansaurus

tags:

views:

answers:

How to accurately measure clock cycles used by a c++ function?

related questions