views:

392

answers:

7

Its not uncommon that I have a program whose performance relies heavily on just a few functions and I want to be able to measure a single loop or code segment's speed down to single-clock precision so that I know whether my changes are actually improving performance or whether I'm just falling for the placebo of "optimized" code.

I personally find myself using ffmpeg's "bench.h", a set of C macros that use rdtsc to measure clock time and automatically compensates for context switches and similar. Of course, this approach has its own weaknesses; what other low-level profiling methods do StackOverflow users like?

+1  A: 

I don't do low-level programming now, but if I did, I would definitely look into dtrace; from what I've read it looks extremely interesting. For OS X users there's also shark.

Aeon
+1  A: 

valgrind is my tool of choice on Unix based systems.

JeffFoster
A: 

The main problem is that when you 'compile in' your benchmarking you potentially modify your results (depending on the how and when of the implementation). And with those low level things you're probably very influenced by your compiler's optimiizations.

But personally on Linux I have a soft spot for oprofile ( oprofile ) this is a system wide profiler which is embedded as a kernel module and periodically traces where you application is spending time. So this profiles your entire system, not only one application. But it could be that it isn't giving you enough granularity.

amo-ej1
A: 

I would advise not instrumenting your code to profile it. The best answer i can give would be to use PTU (Performance Tuning Utility) from Intel, it can be found here:

This utility is the direct descendant of VTune and provide the best available sampling profiler available. You'll be able to track where the CPU is spending or wasting time (with the help of the available hardware events), and this with no slowdown of your application or perturbation of the profile.

Fabien Hure
+2  A: 

valgrind has already been mentioned, but it's especially useful with the callgrind tool:

$ valgrind --tool=callgrind your_program

Then you can use KCacheGrind to visualize the data.

Torsten Marek
A: 

For Linux: Google Perftools

  • Faster than valgrind (yet, not so fine grained)
  • Does not need code instrumentation
  • Nice graphical output (--> kcachegrind)
  • Does memory-profiling, cpu-profiling, leak-checking
Weidenrinde
A: 

OK, you're describing a hot-spot situation - a tight loop that occupies a large fraction of time and does not contain function calls.

You want to know if changes you make are having any effect.

Here's what I would do:

  • To see what to change to make it faster, two methods, bone-simple:

1) Single-step through the inner loop, to see exactly what it's doing and why. Chances are pretty good I will see some things that might be done better.

and / or

2) Get it running in a big outer loop, and then manually interrupt it. Do this several times. The instructions / statements that account for the most time will appear in those samples roughly in proportion to their cost.

  • To tell if I've made any difference, another bone-simple technique:

Run it a billion times in an outer loop and count the seconds. That tells how many nanoseconds the inner loop takes.

Mike Dunlavey