tags:

views:

2046

answers:

6

Is there a version of memset() which sets a value that is larger than 1 byte (char)? For example, let's say we have a memset32() function, so using it we can do the following:

int32_t array[10];
memset32(array, 0xDEADBEEF, sizeof(array));

This will set the value 0xDEADBEEF in all the elements of array. Currently it seems to me this can only be done with a loop.

Specifically, I am interested in a 64 bit version of memset(). Know anything like that?

+2  A: 

wmemset(3) is the wide (16-bit) version of memset. I think that's the closest you're going to get in C, without a loop.

Pi
sweeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeet
bobobobo
-1 for 16-bit. It's `wchar_t` which is 32-bit on any implementation that supports Unicode properly. It's only 16-bit on windows, which ignores the C standard and stores UTF-16 in `wchar_t`.
R..
+8  A: 
void memset64( void *_dest, uint64_t _value, uintptr_t _size )
{
  uintptr_t i;
  for( i = 0; i < (_size & (~7)); i+=8 )
  {
    memcpy( ((char*)_dest) + i, &_value, 8 );
  }  
  for( ; i < _size; i++ )
  {
    ((char*)_dest)[i] = ((char*)&_value)[i&7];
  }  
}

(Explanation, as requested in the comments: when you assign to a pointer, the compiler assumes that the pointer is aligned to the type's natural alignment; for uint64_t, that is 8 bytes. memcpy() makes no such assumption. On some hardware unaligned accesses are impossible, so assignment is not a suitable solution unless you know unaligned accesses work on the hardware with small or no penalty, or know that they will never occur, or both. The compiler will replace small memcpy()s and memset()s with more suitable code so it is not as horrible is it looks; but if you do know enough to guarantee assignment will always work and your profiler tells you it is faster, you can replace the memcpy with an assignment. The second for() loop is present in case the amount of memory to be filled is not a multiple of 64 bits. If you know it always will be, you can simply drop that loop.)

moonshadow
This implementation is more than I bargained for with the question :) Thanks! It would have been nice if you explained the implementation. For example, I can't understand why use a function call to memcpy() instead of an assignment.
gnobal
+2  A: 

Check your OS documentation for a local version, then consider just using the loop.

The compiler probably knows more about optimizing memory access on any particular architecture than you do, so let it do the work.

Wrap it up as a library and compile it with all the speed improving optimizations the compiler allows.

dmckee
+2  A: 

write your own; it's trivial even in asm.

Kevin Conner
example? Do you have a win32 assembly snippet?
bobobobo
A: 

You should really let the compiler optimize this for you as someone else suggested. In most cases that loop will be negligible.

But if this some special situation and you don't mind being platform specific, and really need to get rid of the loop, you can do this in an assembly block.

//pseudo code
asm
{
    rep stosq ...
}

You can probably google stosq assembly command for the specifics. It shouldn't be more than a few lines of code.

kervin
+3  A: 

There's no standard library function afaik. So if you're writing portable code, you're looking at a loop.

If you're writing non-portable code then check your compiler/platform documentation, but don't hold your breath because it's rare to get much help here. Maybe someone else will chip in with examples of platforms which do provide something.

The way you'd write your own depends on whether you can define in the API that the caller guarantees the dst pointer will be sufficiently aligned for 64-bit writes on your platform (or platforms if portable). On any platform that has a 64-bit integer type at all, malloc at least will return suitably-aligned pointers.

If you have to cope with non-alignment, then you need something like moonshadow's answer. The compiler may inline/unroll that memcpy with a size of 8 (and use 32- or 64-bit unaligned write ops if they exist), so the code should be pretty nippy, but my guess is it probably won't special-case the whole function for the destination being aligned. I'd love to be corrected, but fear I won't be.

So if you know that the caller will always give you a dst with sufficient alignment for your architecture, and a length which is a multiple of 8 bytes, then do a simple loop writing a uint64_t (or whatever the 64-bit int is in your compiler) and you'll probably (no promises) end up with faster code. You'll certainly have shorter code.

Whatever the case, if you do care about performance then profile it. If it's not fast enough try again with more optimisation. If it's still not fast enough, ask a question about an asm version for the CPU(s) on which it's not fast enough. memcpy/memset can get massive performance increases from per-platform optimisation.

Steve Jessop
Thanks for the detailed answer
gnobal