views:

286

answers:

2

Hello

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

+2  A: 

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Paul R
Thanks. How many SSE loads can be issued per clock?I want to find peak load/store bandwidth for several generations of x86. Not only the memcpy, also a plain read and plain write (closer to STREAM benchmark)
osgx
@osgx - it depends on the CPU - Core 2 and Core i7 can both *issue* 2 SSE loads per clock
Paul R
About fastest memcpy - Yes, the question can be reasked as "What is the theoretical fastest memcpy" (without actual implementation) and not only for very big data (as usual), but for small too (up to L1/2 size, up to L2/2 size, L3/3 size).
osgx
+1  A: 

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326&p=5

           L1     L2     L3        mem
core 2      3     15     --
core i7     4     11     39
itanium     1     5-6    12-17   130-1000
osgx
Answering your own question ? You still haven't explained what is is that you are trying to achieve with this information. You may get a better answer if you do.
Paul R
I studying cpu architectures and want to compare them.
osgx