views:

2295

answers:

6

We have Core2 machines (Dell T5400) with XP64.

We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT's memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.

My question is, what's this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or... what ?

Thanks for any insight.

Also raised on Intel forums.

+1  A: 

My off-the-cuff guess is that the 64 bit processes are using the processor's native 64-bit memory size, which optimizes the use of the memory bus.

Harper Shelby
+8  A: 

I think the following can explain it:

To copy data from memory to a register and back to memory, you do

mov eax, [address]
mov [address2], eax

This moves 32 bit (4 byte) from address to address2. The same goes with 64 bit in 64 bit mode

mov rax, [address]
mov [address2], rax

This moves 64 bit, 2 byte, from address to address2. "mov" itself, regardless of whether it is 64 bit or 32 bit has a latency of 0.5 and a throughput of 0.5 according to Intel's specs. Latency is how many clock cycles the instruction takes to travel through the pipeline and throughput is how long the CPU has to wait before accepting the same instruction again. As you can see, it can do two mov's per clock cycle, however, it has to wait half a clock cycle between two mov's, thus it can effectively only do one mov per clock cycle (or am I wrong here and misinterpret the terms? See PDF here for details).

Of course a mov reg, mem can be longer than 0.5 cycles, depending if the data is in 1st or 2nd level cache, or not in cache at all and needs to be grabbed from memory. However, the latency time of above ignores this fact (as the PDF states I linked above), it assumes all data necessary for the mov are present already (otherwise the latency will increase by how long it takes to fetch the data from wherever it is right now - this might be several clock cycles and is completely independent of the command being executed says the PDF on page 482/C-30).

What is interesting, whether the mov is 32 or 64 bit plays no role. That means unless the memory bandwidth becomes the limiting factor, 64 bit mov's are equally fast to 32 bit mov's, and since it takes only half as many mov's to move the same amount of data from A to B when using 64 bit, the throughput can (in theory) be twice as high (the fact that it's not is probably because memory is not unlimited fast).

Okay, now you think when using the larger SSE registers, you should get faster throughput, right? AFAIK the xmm registers are not 256, but 128 bit wide, BTW (reference at Wikipedia). However, have you considered latency and throughput? Either the data you want to move is 128 bit aligned or not. Depending on that, you either move it using

movdqa xmm1, [address]
movdqa [address2], xmm1

or if not aligned

movdqu xmm1, [address]
movdqu [address2], xmm1

Well, movdqa/movdqu has a latency of 1 and a throughput of 1. So the instructions take twice as long to be executed and the waiting time after the instructions is twice as long as a normal mov.

And something else we have not even taken into account is the fact that the CPU actually splits instructions into micro-ops and it can execute these in parallel. Now it starts getting really complicated... even too complicated for me.

Anyway, I know from experience loading data to/from xmm registers is much slower than loading data to/from normal registers, so your idea to speed up transfer by using xmm registers was doomed from the very first second. I'm actually surprised that in the end the SSE memmove is not much slower than the normal one.

Mecki
Very well written, I understood it and I don't know much about how processors actually operate.
cfeduke
Well this is all very fine (thanks for SSE width correction) but it doesn't actually answer the basic question: why does code which should simply saturate memory bandwidth perform so much better in native 64bit rather than as 32bit under WOW64. Where's the bottleneck ?
timday
A: 

I don't have a reference in front of me, so I'm not absolutely positive on the timings/instructions, but I can still give the theory. If you're doing a memory move under 32-bit mode, you'll do something like a "rep movsd" which moves a single 32-bit value every clock cycle. Under 64-bit mode, you can do a "rep movsq" which does a single 64-bit move every clock cycle. That instruction is not available to 32-bit code, so you'd be doing 2 x rep movsd (at 1 cycle a piece) for half the execution speed.

VERY much simplified, ignoring all the memory bandwidth/alignment issues, etc, but this is where it all begins...

Brian Knoblauch
But that doesn't explain why code copying via SSE registers (which are 128bit whether you're in 32-bit or 64-bit mode) seems to be bandwidth limited in 32bit.
timday
SSE registers should be doing stores at the width of the data bus (64-bit). However, since I don't have the timings in front of me, SSE stores could use twice the clock cycles of a normal register store and hence have the same data rate as a 32-bit copy.
Brian Knoblauch
+2  A: 

Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.

My quess is that it probably doesn't have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.

If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine's cache, modified, and written back out. Since that read is totally unnecessary -- the bits being read are overwritten immediately -- you can save half the memory bandwidth by using the "non-temporal" write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.

I don't know the Intel compiler's CRT library, so this is just a guess. There's no particular reason why the 32-bit libCRT can't do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt...

Since memcpy is not doing any calculations, it's always bound by how fast you can read and write memory.

Die in Sente
Yup turns out you were right about the non-temporal stores. See my answer for the gnarly asm-level details. Fundamental problem seems to be that Intel compiler/CRT doesn't always use it's non-temporal version of memcpy in 32bit.
timday
+3  A: 

I finally got to the bottom of this (and Die in Sente's answer was on the right lines, thanks)

In the below, dst and src are 512 MByte std::vector. I'm using the Intel 10.1.029 compiler and CRT.

On 64bit both

memcpy(&dst[0],&src[0],dst.size())

and

memcpy(&dst[0],&src[0],N)

where N is previously declared const size_t N=512*(1<<20); call

__intel_fast_memcpy

the bulk of which consists of:

  000000014004ED80  lea         rcx,[rcx+40h] 
  000000014004ED84  lea         rdx,[rdx+40h] 
  000000014004ED88  lea         r8,[r8-40h] 
  000000014004ED8C  prefetchnta [rdx+180h] 
  000000014004ED93  movdqu      xmm0,xmmword ptr [rdx-40h] 
  000000014004ED98  movdqu      xmm1,xmmword ptr [rdx-30h] 
  000000014004ED9D  cmp         r8,40h 
  000000014004EDA1  movntdq     xmmword ptr [rcx-40h],xmm0 
  000000014004EDA6  movntdq     xmmword ptr [rcx-30h],xmm1 
  000000014004EDAB  movdqu      xmm2,xmmword ptr [rdx-20h] 
  000000014004EDB0  movdqu      xmm3,xmmword ptr [rdx-10h] 
  000000014004EDB5  movntdq     xmmword ptr [rcx-20h],xmm2 
  000000014004EDBA  movntdq     xmmword ptr [rcx-10h],xmm3 
  000000014004EDBF  jge         000000014004ED80

and runs at ~2200 MByte/s.

But on 32bit

memcpy(&dst[0],&src[0],dst.size())

calls

__intel_fast_memcpy

the bulk of which consists of

  004447A0  sub         ecx,80h 
  004447A6  movdqa      xmm0,xmmword ptr [esi] 
  004447AA  movdqa      xmm1,xmmword ptr [esi+10h] 
  004447AF  movdqa      xmmword ptr [edx],xmm0 
  004447B3  movdqa      xmmword ptr [edx+10h],xmm1 
  004447B8  movdqa      xmm2,xmmword ptr [esi+20h] 
  004447BD  movdqa      xmm3,xmmword ptr [esi+30h] 
  004447C2  movdqa      xmmword ptr [edx+20h],xmm2 
  004447C7  movdqa      xmmword ptr [edx+30h],xmm3 
  004447CC  movdqa      xmm4,xmmword ptr [esi+40h] 
  004447D1  movdqa      xmm5,xmmword ptr [esi+50h] 
  004447D6  movdqa      xmmword ptr [edx+40h],xmm4 
  004447DB  movdqa      xmmword ptr [edx+50h],xmm5 
  004447E0  movdqa      xmm6,xmmword ptr [esi+60h] 
  004447E5  movdqa      xmm7,xmmword ptr [esi+70h] 
  004447EA  add         esi,80h 
  004447F0  movdqa      xmmword ptr [edx+60h],xmm6 
  004447F5  movdqa      xmmword ptr [edx+70h],xmm7 
  004447FA  add         edx,80h 
  00444800  cmp         ecx,80h 
  00444806  jge         004447A0

and runs at ~1350 MByte/s only.

HOWEVER

memcpy(&dst[0],&src[0],N)

where N is previously declared const size_t N=512*(1<<20); compiles (on 32bit) to a direct call to a

__intel_VEC_memcpy

the bulk of which consists of

  0043FF40  movdqa      xmm0,xmmword ptr [esi] 
  0043FF44  movdqa      xmm1,xmmword ptr [esi+10h] 
  0043FF49  movdqa      xmm2,xmmword ptr [esi+20h] 
  0043FF4E  movdqa      xmm3,xmmword ptr [esi+30h] 
  0043FF53  movntdq     xmmword ptr [edi],xmm0 
  0043FF57  movntdq     xmmword ptr [edi+10h],xmm1 
  0043FF5C  movntdq     xmmword ptr [edi+20h],xmm2 
  0043FF61  movntdq     xmmword ptr [edi+30h],xmm3 
  0043FF66  movdqa      xmm4,xmmword ptr [esi+40h] 
  0043FF6B  movdqa      xmm5,xmmword ptr [esi+50h] 
  0043FF70  movdqa      xmm6,xmmword ptr [esi+60h] 
  0043FF75  movdqa      xmm7,xmmword ptr [esi+70h] 
  0043FF7A  movntdq     xmmword ptr [edi+40h],xmm4 
  0043FF7F  movntdq     xmmword ptr [edi+50h],xmm5 
  0043FF84  movntdq     xmmword ptr [edi+60h],xmm6 
  0043FF89  movntdq     xmmword ptr [edi+70h],xmm7 
  0043FF8E  lea         esi,[esi+80h] 
  0043FF94  lea         edi,[edi+80h] 
  0043FF9A  dec         ecx  
  0043FF9B  jne         ___intel_VEC_memcpy+244h (43FF40h)

and runs at ~2100MByte/s (and proving 32bit isn't somehow bandwidth limited).

I withdraw my claim that my own memcpy-like SSE code suffers from a similar ~1300 MByte/limit in 32bit builds; I now don't have any problems getting >2GByte/s on 32 or 64bit; the trick (as the above results hint) is to use non-temporal ("streaming") stores (e.g _mm_stream_ps intrinsic).

It seems a bit strange that the 32bit "dst.size()" memcpy doesn't eventually call the faster "movnt" version (if you step into memcpy there is the most incredible amount of CPUID checking and heuristic logic e.g comparing number of bytes to be copied with cache size etc before it goes anywhere near your actual data) but at least I understand the observed behaviour now (and it's not SysWow64 or H/W related).

timday
A: 

Hi timday,

Thanks for the positive feedback! I think I can partly explain what's going here.

Using the non-temporal stores for memcpy is definitely the fasted if you're only timing the memcpy call.

On the other hand, if you're benchmarking an application, the movdqa stores have the benefit that they leave the destination memory in cache. Or at least the part of it that fits into cache.

So if you're designing a runtime library and if you can assume that the application that called memcpy is going to use the destination buffer immediately after the memcpy call, then you'll want to provide the movdqa version. This effectively optimizes out the trip from memory back into the cpu that would follow the movntdq version, and all of the instructions following the call will run faster.

But on the other hand, if the destination buffer is large compared to the processor's cache, that optimization doesn't work and the movntdq version would give you faster application benchmarks.

So the idea memcpy would have multiple versions under the hood. When the destination buffer is small compared to the processor's cache, use movdqa, otherwise, then the destination buffer is large compared to the processor's cache, use movntdq. It sounds like this is what's happening in the 32-bit library.

Of course, none of this has anything to do with the differences between 32-bit and 64-bit.

My conjecture is that the 64-bit library just isn't as mature. The developers just haven't gotten around to providing both routines in that version of library yet.

Die in Sente
Yup the whole question of what state you want the cache in post-copy is an interesting one. I was using >256MByte copies. If I copy something more comparable with cache size, I see all the memcpy I've looked at sensibly revert from streaming (non-temporal) stores to conventional moves.
timday