views:

242

answers:

3

Dear all,

Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?

A: 

No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.

That said, why do you care?

sizzzzlerz
actually an ADD in computer land is always 1 register plus another. It would never be 7 registers at once, so thats a falsehood.
Woot4Moo
I care for a study I'm doing.
Tony
@WootMoo:That depends on the processor and instruction set involved. Just for example, the VAX has an "AddP6" instruction ("Add packed, 6 operand"). The x86 doesn't fit your description either.
Jerry Coffin
@Woot4Moo: You may want to consult the processor manuals you use before you talk about what an `ADD` "always is". The Intel 64-bit (non-Itanium) chips have 13 different `ADD` forms that can have memory as the destination of the operation and 5 forms that have memory as the source. Indeed the only combinations of operands not supported are immediate/immediate and memory/memory.
JUST MY correct OPINION
I learn something new every day, for instance that a single thread can do multiple additions at once.
Woot4Moo
+2  A: 

Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.

Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.

Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.

In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.

Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.

JustJeff
+9  A: 

Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.

On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.

On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.

That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:

.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC

Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:

  1. set the registers up correctly before each CPUID
  2. save the value in EAX:EDX after the first RDTSC
  3. subtract result from the second RDTSC from the first

Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.

Jerry Coffin
i tried to say that part, about the value of measurement, but you said it better. you get my +1!
JustJeff