views:

99

answers:

5

Hi all,

I am trying to benchmark a piece of software that runs on an Intel Pentium with Linux on top of it. The problem is, that I get considerable performance variations during consecutive test runs, when using the RDTSC instruction. Runtimes of exactly the same piece of software vary between 5 million and 10 million clock cycles, so in the worst case scenario I have an overhead of 100%. I am aware that there are performance variations caused by cache contention, however, is there maybe I way that I can eliminate other potential problems like interrupts, other processes etc.?

Would be thankful for any useful tips how to do this properly.

Many thanks, Kenny

A: 

Some general things: raise the test process priority (man 1 nice), stop as many other process as possible, unload unused kernel modules, flush disk caches (so that background kernel threads have less work), reboot in the single-user mode?

ygrek
+2  A: 

Common problems in this general area are:

  • process migration in multi-CPU/multi-core systems
  • RDTSC not consistent across cores in multi-CPU/multi-core systems
  • other processes taking CPU time (also interrupts, I/O, screen activity, etc)
  • automatic CPU clock frequency scaling
  • VM page faults etc

Solutions:

  • If you're running a single threaded process on a multi-CPU/multi-core systems then use CPU affinity to lock the process to a specific core. (Use taskset from the command line or call sched_setaffinity() from within you code.)

  • make sure you have no other processes taking CPU time, disable screen savers or other desktop animations and make sure there are no screen updates while your code is running. Also don't use e.g. printf to a GUI console window during your code timing - save any results output until after you've collected your last timestamp. (If possible you could even consider killing the GUI completely.)

  • Use a more reliable timing method than RDTSC (I typically use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...) on Linux).

  • Disable automatic clock frequency scaling (e.g. Linux: cpufreq-set)

  • Run your code in a loop, for say N repeats, preferably re-using the same memory allocations for any large data structures (to get rid of the effects of VM page faults etc). Ignore the first measurement and average the remaining N - 1 measurements.

Paul R
I am aware of the RDTSC problematic when having multiple cores! For that, I disabled one of the cores during boot to make sure that is not becoming an issue for my measurements. All the other things I have pretty much considered. Thanks for your help
Kenny
A: 

The best way to reduce variations caused by the system environment would be running your benchmark in "single user" mode, also known as initlevel 1, or "recovery mode".

You can boot into this mode by passing "-s" as a boot time option to the kernel, or you can switch a running system to it with "init 1".

In this mode, all daemons are stopped, and you are logged in as root. Pretty much anything that runs on the system runs from your interactive terminal.

ddaa
That sounds good, I will give it a go!
Kenny
Tried it, unfortunately the variations still remain in place.
Kenny
A: 

Please make sure you deactivate frequency scaling in the BIOS and the operating system. Also it sounds like you are using a P4, so make sure you turn off hyperthreading.

I have encountered performance variations like you describe in the past, due to such things.

This page describes how to turn it on, which which should give you what you need to turn it off.

You will also need to reboot your machine and look in the bios settings to determine if it is doing it automatically, without the operating system knowing.

Alex Brown
Thanks for the clue. So are you saying that I should check in the BIOS first if I can disable frequency scaling before tackling this problem at the OS level? Or do I also need to make the changes in the OS? Cheers
Kenny
fixing the bios is easier, and if you don't fix it you won't make any headway with the OS - so do it first.
Alex Brown
A: 

Have you considered running the code inside valgrinds cachegrind or callgrind tools? These should be able to provide you with accurate instruction counts by running the code through valgrinds "VM".

Michael Anderson