views:

516

answers:

6

I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler built-in intrinsics, but based on disassembly and debugging it seems compiler is using call to memcpy/memmove library implementation instead. I also hoped the compiler might be smart enough to recognize following loop and use rep movsd on its own, but it seems it does not.

char *dst;
const char *src;
// ...
for (int r=size; --r>=0; ) *dst++ = *src++;

Is there some way to make the Visual Studio compiler to generate rep movsd sequence other than using inline assembly?

+2  A: 

Are you running an optimised build? It won't use an intrinsic unless optimisation is on. Its also worth noting that it will probably use a better copy loop than rep movsd. It should try and use MMX, at the least, to perform a 64-bit at a time copy. In fact 6 or 7 years back I wrote an MMX optimised copy loop for doing this sort of thing. Unfortunately the compiler's intrinsic memcpy outperformed my MMX copy by about 1%. That really taught me not to make assumptions about what the compiler is doing.

Goz
What I see is the compiler is calling the generic memmove function. This function has great through put (using alignmed copying, MMX and even SSE as needed), however its setup overhead is way too high, and makes the function unsuitable for copying a few bytes only.
Suma
Thing is if you are only copying very few bytes then the compielr will even optimise the memcpy away completely. For example if you memcpy the 4 bytes of a float into an int (thereby avoiding any potential aliasing) then GCC AND the MSVC compiler will remove the memcpy completely (I've check this). There must be something you are doing that prevents the memcpy from being removed. Also .. there is nothing stopping you writing a bit of assembler to the movsd but, i suspect, you'll find its not faster than calling memcpy.
Goz
... you are correct, but the problem is it optimized is away only when the compiler knows is is few bytes only (i.e. size is small compile time known constant). When it does not (size is not compile time known), it assumes the most likely case of large block and calls library implementation.
Suma
Hmm that does make sense. How "could" the compiler optimise that away. As i've said already, write some assembler and compare speeds. Either that or use a switch that provides the easiest implementations of memcpy so that they can be optimised out. The switch really won't be instruction cache friendly though, so much so that you'd be better off just calling memcpy.
Goz
A: 

Note that in order to use movsd, src must point to a memory aligned to 32-bit boundary and its length must be a multiple of 4 bytes.

If it is, why does your code use char * instead of int * or something? If it's not, your question is moot.

If you change char * to int *, you might get better result from std::copy.

Edit: have you measured that the copying is the bottleneck?

avakar
movsb would do as well. Note: while you are correct about size, movsd does not require DWORD alignment of target or source.
Suma
It does not require the alignment, but doing unaligned `movsd` won't be very fast.
avakar
if you're at all concerned about performance, then yes, it requires aligned data. ;)
jalf
A: 

Have you timed memcpy? On recent versions of Visual Studio, the memcpy implementation uses SSE2... which should be faster than rep movsd. If the block you're copying is 1 KB, then it's not really a problem that the compiler isn't using an intrinsic since the time for the function call will be negligible compared to the time for the copy.

Martin B
The block is under 1 KB. Sometimes a few bytes only, sometimes 10, sometimes ~200 B.
Suma
Ah, OK. How about deciding at runtime, based on the size of the block to copy, whether to call memcpy? Say size>32 (or some other value determined to be optimal), call memcpy, otherwise do your own (possible assembly-optimized) copy. You could wrap this logic in an inline function `mymemcpy()`.
Martin B
+2  A: 

Several questions come to mind.

First, how do you know movsd would be faster? Have you looked up its latency/throughput? The x86 architecture is full of crufty old instructions that should not be used because they're just not very efficient on modern CPU's.

Second, what happens if you use std::copy instead of memcpy? std::copy is potentially faster, as it can be specialized at compile-time for the specific data type.

And third, have you enabled intrinsic functions under project properties -> C/C++ -> Optimization?

Of course I assume other optimizations are enabled as well.

jalf
A: 

Use memcpy. This problem has already been solved.

FYI rep movsd is not always the best, rep movsb can be faster in some circumstances and with SSE and the like the best is movntq [edi], xmm0. You can even optimize further for large amount of memory in using page locality by moving data to a buffer and then moving it to your destination.

Edouard A.
I am not optimizing for large amount of memory. I am optimizing for short copied sequences, and I have found memcpy setup overhead to be unacceptably high. Even a simple for loop as in my question performs better than it in such scenario.
Suma
This is memcpy in VS 2005 source code : while (count--) { *(char *)dst = *(char *)src; dst = (char *)dst + 1; src = (char *)src + 1; }Which VS are you using? Which optimizations?
Edouard A.
I think the problem is one abstraction layer above your memcpy. The problem is not that memcpy for small buffers is slow, it's that you're doing many memcpy for small buffers in the first place. Do you get the performances you want with hand written rep movsb?
Edouard A.
VS 2005. Are you sure about the source? In my case I can see memcpy.asm being called, with the same source implementing both memcpy and memmove.
Suma
+1  A: 

Using memcpy with a constant size

What I have found meanwhile:

Compiler will use intrinsic when the copied block size is compile time known. When it is not, is calls the library implementation. When the size is known, the code generated is very nice, selected based on the size. It may be a single mov, or movsd, or movsd followed by movsb, as needed.

It seems that if I really want to use movsb or movsd always, even with a "dynamic" size I will have to use inline assembly or special intrinsic (see below). I know the size is "quite short", but the compiler does not know it and I cannot communicate this to it - I have even tried to use __assume(size<16), but it is not enough.

Demo code, compile with "-Ob1 (expansion for inline only):

  #include <memory.h>

  void MemCpyTest(void *tgt, const void *src, size_t size)
  {
    memcpy(tgt,src,size);
  }

  template <int size>
  void MemCpyTestT(void *tgt, const void *src)
  {
    memcpy(tgt,src,size);
  }

  int main ( int argc, char **argv )
  {
    int src;
    int dst;
    MemCpyTest(&dst,&src,sizeof(dst));
    MemCpyTestT<sizeof(dst)>(&dst,&src);
    return 0;
  }

Specialized intrinsics

I have found recently there exists very simple way how to make Visual Studio compiler copy characters using movsd - very natural and simple: using intrinsics. Following intrinsics may come handy:

Suma
Then your best bet is to write some simple assembler. It won't be hard. Just remember to profile it against the memcpy to make sure you ARE actually getting a win, performance wise.
Goz
How about using fixed size blocks in your allocations? Always allocate in blocks of 32 or 64 bytes and copy the entire thing. I would bet that the extra 30-some bytes in a copy is hardly noticable.
Zan Lynx