views:

140

answers:

5

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.

static __inline__ unsigned long GetCC(void)
{
  unsigned a, d; 
  asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
  return ((unsigned long)a) | (((unsigned long)d) << 32); 
}

I don't think it is the best since even two consecutive calls gives me a difference of "33". Any suggestions ?

+1  A: 

Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.

The right way is either:

  • Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
  • Use a cycle-level emulating profiler like Callgrind / kcachegrind.

By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.

caf
Not to mention that the serialization before `RDTSC` will have a negative effect on your measurements.
Nathan Fellman
Yup, I knew about CPUID, but I was not aware what it does. About individual execution, yes, I agree. I was testing across 1000 runs, and removing the time to do 1000 runs and my guess is that the number 33 was coming due to `RDTSC`. `clock()` did not really work for me. But I'll look up kcachegrind when the full software is done.
Humble Debugger
+1  A: 

The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.

Nathan Fellman
+1  A: 

You are on the right track1, but you need to do two things:

  1. Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
  2. Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.

1I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.

Alex B
Thanks. I need to write a function that takes atmost 1 microsecond, hence the need to use `rdtsc`. Besides the "33" between 2 calls, I am pretty happy with `rdtsc` so far. I checked, the cpu has the `contant_tsc` flag.
Humble Debugger
+2  A: 

Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.

If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.

That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.

nategoose
I believe the TSC is synchronized between cores, so it's not a concern
Nathan Fellman
@Nathan Fellman: According to http://en.wikipedia.org/wiki/Time_Stamp_Counter not on some older AMDs
nategoose
ahah... good point. I was thinking of Intel CPUs
Nathan Fellman
A: 

Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?

I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.

Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?

If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.

Mike Dunlavey