tags:

views:

638

answers:

3

Using memcpy() when source and destination overlap can lead to undefined behaviour - in those cases only memmove() can be used.

But what if I know for sure buffers don't overlap - is there a reason to use specifically memcpy() or specifically memmove()? Which should I use and why?

+20  A: 

memcpy() doesn't have any special handling for overlapping buffers so it lacks some checks therefore it is faster than memmove().

Also on some architectures memcpy() can benefit from using CPU instructions for moving blocks of memory - something that memmove() cannot use.

qrdl
your edit captured what I wanted to say, +1.
roe
+1, for mentioning CPU instructions.
MAK
Even on a RISC architecture, there are often block-move operations from which memcpy() can benefit. PowerPC has VMX, for example.
Crashworks
Nah, decent code generators produce rep movs after checking for no overlap. MSVC does.
Hans Passant
@nobugz Not always at compile time you can determine whether buffers overlap or not. Or did you mean checking at run time?
qrdl
@Crashworks Interesting, didn't know that. Seems I had experience with more RISCy RISCs, that have just load/store instructions to access memory.
qrdl
@qrdl: RISC doesn't have rep movs, but many RISC architectures have vector registers that are wider than the scalar core registers, and have a correspondingly wider path to/from memory.
Stephen Canon
Obviously I was wrong about RISC architectures - I removed that bit
qrdl
A few of the newer optimizations for memcpy() right now on modern CPU's are cache-based. Either using a temporary cache area for reads, cache prefetching from source, cache-zeroing on destination (dcbz on PPC), etc. Some CPU's also have "DMA-like" extensions for fully asynchronous copying. A good implementation of memmove will use the same optimized code as memcpy but it does indeed require a check to see if they overlap first so it will be slightly slower if you already know no overlap exists.
Adisak
+2  A: 

If you're interested in which will perform better, you need to test it on the target platform. Nothing in the standard mandates how the functions are implemented and, while it may seem logical that a non-checking memcpy would be faster, this is by no means a certainty.

It's quite possible, though unlikely, that the person who wrote memmove for your particular compiler was a certified genius while the poor soul who got the job of writing memcpy was the village idiot :-)

Although, in reality, I find it hard to imagine the memmove could be faster than memcpy, I don't discount the possibility. Measure, don't guess.

paxdiablo
`memcpy` has the restrict qualifier on its arguments, not `memmove`. (It codifies precisely the fact that the buffers don't overlap).
Stephen Canon
D'Oh! You're right, of course, @StephenC, I got them the wrong way around. Removed that twaddle from my answer :-)
paxdiablo
+8  A: 

Assuming a sane library implementor, memcpy will always be at least as fast as memmove. However, on most platforms the difference will be minimal, and on many platforms memcpy is just an alias for memmove to support legacy code that (incorrectly) calls memcpy on overlapping buffers.

Both memcpy and memmove should be written to take advantage of the fastest loads and stores available on the platform.

To answer your question: you should use the one that is semantically correct. If you can guarantee that the buffers do not overlap, you should use memcpy. If you cannot guarantee that the buffers don't overlap, you should use memmove.

Stephen Canon
+1. I especially like the "assuming sane" counterpoint to my own answer :-)
paxdiablo
Nitpick: `memcpy` and `memmove` should be written to take advantage of the fastest `unaligned` loads and stores available on the platform. If you know your buffers are aligned properly, you can often get much better performance using things like MMX, which copy much larger data units at a time.
Adam Rosenfield
@Adam: Generally speaking one can arrange to use aligned loads and stores in memcopy by first copying some smaller units to achieve appropriate alignment. If the buffers do not have similar alignment, it will be necessary to apply some shift or permute before storing, but this is faster than using unaligned memory accesses on many architectures.
Stephen Canon