Its not uncommon that I have a program whose performance relies heavily on just a few functions and I want to be able to measure a single loop or code segment's speed down to single-clock precision so that I know whether my changes are actually improving performance or whether I'm just falling for the placebo of "optimized" code.
I personally find myself using ffmpeg's "bench.h", a set of C macros that use rdtsc to measure clock time and automatically compensates for context switches and similar. Of course, this approach has its own weaknesses; what other low-level profiling methods do StackOverflow users like?