tags:

views:

298

answers:

7

Hi,

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

+4  A: 

memset is generally designed to be very very fast general-purpose setting/zeroing code. It handles all cases with different sizes and alignments, which affect the kinds of instructions you can use to do your work. Depending on what system you're on (and what vendor your stdlib comes from), the underlying implementation might be in assembler specific to that architecture to take advantage of whatever its native properties are. It might also have internal special cases to handle the case of zeroing (versus setting some other value).

That said, if you have very specific, very performance critical memory zeroing to do, it's certainly possible that you could beat a specific memset implementation by doing it yourself. memset and its friends in the standard library are always fun targets for one-upmanship programming. :)

quixoto
Also: memset could in theory have a special case for 0 which is selected at compile-time (either by inlining or as an intrinsic operation) when that argument is a literal. Don't know whether anyone does or not.
Steve Jessop
@Steve Jessop: Interesting idea (esp that it could be compile-time). I remember reading someone's maverick implementation of memset once that had special cases for just about everything you'd actually use memset for.
quixoto
gcc typically uses an inline builtin implementation of `memset()`. Funnily enough, I remember reading about a buggy implementation of `memset()` that always set the value to 0 - and this wasn't noticed for *years*, because apparently the vast majority of time `memset()` is used to set to zero!
caf
+1  A: 

If I remember correctly (from a couple of years ago), one of the senior developers was talking about a fast way to bzero() on PowerPC (specs said we needed to zero almost all the memory on power up). It might not translate well (if at all) to x86, but it could be worth exploring.

The idea was to load a data cache line, clear that data cache line, and then write the cleared data cache line back to memory.

For what it is worth, I hope it helps.

Sparky
+3  A: 

x86 is rather broad range of devices.

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

Tim
+2  A: 

Nowadays your compiler should do all the work for you. At least of what I know gcc is very efficient in optimizing calls to memset away (better check the assembler, though).

Then also, avoid memset if you don't have to:

  • use calloc for heap memory
  • use proper initialization (... = { 0 }) for stack memory

And for really large chunks use mmap if you have it. This just gets zero initialized memory from the system "for free".

Jens Gustedt
A: 

Also see the question Strange assembly from array 0-initialization for a comparison of memset and = { 0 }.

Johann Gerell
+1  A: 

The memset function is designed to be flexible and simple, even at the expense of speed. In many implementations, it is a simple while loop that copies the specified value one byte at a time over the given number of bytes. If you are wanting a faster memset (or memcpy, memmove, etc), it is almost always possible to code one up yourself.

The simplest customization would be to do single-byte "set" operations until the destination address is 32- or 64-bit aligned (whatever matches your chip's architecture) and then start copying a full CPU register at a time. You may have to do a couple of single-byte "set" operations at the end if your range doesn't end on an aligned address.

Depending on your particular CPU, you might also have some streaming SIMD instructions that can help you out. These will typically work better on aligned addresses, so the above technique for using aligned addresses can be useful here as well.

For zeroing out large sections of memory, you may also see a speed boost by splitting the range into sections and processing each section in parallel (where number of sections is the same as your number or cores/hardware threads).

Most importantly, there's no way to tell if any of this will help unless you try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard 'memset' as well (their implementation might be more efficient than your compiler's).

bta
A: 

Unless you have specific needs or know that your compiler/stdlib is sucky, stick with memset. It's general-purpose, and should have decent performance in general. Also, compilers might have an easier time optimizing/inlining memset() because it can have intrinsic support for it.

For instance, Visual C++ will often generate inline versions of memcpy/memset that are as small as a call to the library function, thus avoiding push/call/ret overhead. And there's further possible optimizations when the size parameter can be evaluated at compile-time.

That said, if you have specific needs (where size will always be tiny _or_ huge), you can gain speed boosts by dropping down to assembly level. For instance, using write-through operations for zeroing huge chunks of memory without polluting your L2 cache.

But it all depends - and for normal stuff, please stick to memset/memcpy :)

snemarch