views:

3199

answers:

6

Hello everyone,

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?

Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}

#endif // WIN32

BTW: here is the URL for the code I posted.

http://www.codeproject.com/KB/threads/spinlocks.aspx

thanks in advance,

+1  A: 

Is this of use ? I don't believe it covers the latest instructions but should be of use for examples like the one above.

Brian Agnew
@Brian, are there any Intel or AMD documents which covers how much CPU cycles are neeeded for a specific assembly instruction, like mov/xchg?
George2
BTW: the document you recommended seems too old. :-)
George2
+7  A: 

See:

Can Berk Güder
+1. On the Intel PDF see "Appendix C - Instruction Latency and Throughput"
jcinacio
@Can and @jcinacio, the documents you recommended are so excellent, but I am too stupid. :-( Could you guide me to explain how to find how much CPU cycles needed for instruction like mov/xchg? I looked in the Appendix C tables and are confusing to find out what each column and number mean.
George2
(continued) so if I could learn from you by 1-2 samples, I think I am confident to look for CPU cycles needed for any instructions in the future! :-)
George2
@George2: See table C-13 and C-13a for the general purpose instructions, and see section C-2 for the definition of Latency and Throughput in the Intel PDF.
Can Berk Güder
+13  A: 

Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.

If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.

Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.

These days people generally use statistical analysis for code timing estimation.

Adam Davis
Excellent answer! Covers every counter question one might have.
CDR
Technically not entirely accurate. Each instruction does have a fixed duration/latency, as specified in Can Berk Güders answer. For the reasons you point out, this alone is only part of the story though. Knowing the latency of each instruction doesn't tell you when it gets scheduled.
jalf
@jalf - True, but as you agree the ultimate effect of these things means that there's no easy way to calculate the real timing of the instruction in advance of running them.
Adam Davis
@Adam, 1. I only need to know CPU cycle for a specific assemblt instruction, not "section of assembly code". 2. After learning your reply, I think for the same assembly instruction like mov, it will have different CPU cycles on different architecures, correct? (continued)
George2
(continue) 3. are there any reference material, for example, for a specific model of Intel Processor, how many the CPU cycles are needed for a specific assembly instruction?
George2
@jalf, Could you guide me to explain how to find how much CPU cycles needed for instruction like mov/xchg? I looked in documents recommended by Can, but feel confusing to find what exactly each columns mean in tables. Thanks.
George2
@George2 - 1. Unfortunately, due to pipelining, out of order execution, and caching, a given assembly code will not always take the same amount of time. Even the documents provided by Can Berk Güder are statistical analysis.
Adam Davis
2. Yes, the same instruction will have different cycles on different architectures. Can Berk Güder's documents show this quite clearly - the Pentium M executes some instructions in fewer cycles than the newer Core 2, but it runs more slowly and has a shorter pipeline, so the throughput is lower.
Adam Davis
3. Can Berk Güder has provided good guides for this. If, after reading all this, you still don't understand how to use them, or why there's no simple way to determine the cpu cycle count, then there's not much more help we can provide. You're best bet is to run some tests.
Adam Davis
+5  A: 

Measuring and counting CPU-cycles does not make sense on the x86 anymore.

First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.

The last x86 where cycle-counting made sense was the Pentium-Pro.

Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.

So the time for a instruction depends not only on the instruction itself but also on the surrounding code.

Anyway: You can estimate the resource usage, latency and thus the speed of instructions for different processors. The relevant information can be found at the Intel and AMD sites.

Agner Fog has a very nice summary on his web-site:

http://www.agner.org/


Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.

That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.

Nils Pipenbrinck
@Nils, I think you mean for the overall elapsed time for an instruction, it varies dependent on system resource status and scheduling. But I think once the instruction is executing, it will be executed in fixed CPU cycles for a specific architecture, correct?
George2
@Nils, the code sample is just for my leaning purpose to learn spin lock, for real programming practices, I will definitely use interlock functions.
George2
BTW: on http://www.agner.org/ where is the information shows CPU cycle needed for an assembly instruction? I looked some time in this site, but find nothing. Could you give 1-2 links please? :-)
George2
+3  A: 

What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.

The exact latencies for Intels and AMD's processors are listed in the manuals shown in Can Berk Güder's answer. (AMD also has pdf manuals on their own website with the official values)

For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).

Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.

It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.

Edit Looking in Intel's optimization manual, table C-13: The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.

The throughput columns show how many of this type of instructions can be executed per cycle.

Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1 (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)

I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.

jalf
@jalf, could you guide me to explain how to find how much CPU cycles needed for instruction like mov/xchg? I looked in documents recommended mentioned by others from Intel, but feel confusing to find what exactly each columns mean in tables. Thanks.
George2
The latency columns show you how many cycles it takes from the instruction is initiated, until the result of it is available. Intel subdivides this into different CPUID's, to show the values for various families of CPU's xchg is listed as 1-3 cycles depending on CPU, and mov is 0.5-1.
jalf
Edited my post to add these details
jalf
A: 

lock xchg eax, dword ptr [edx]

Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.

So optimal performance gets back to tuning your algorithms critical regions.

Note on a single core you can optmize this by removing the lock but it is needed for multi core.

ben