views:

252

answers:

8

Is there a function (SSEx intrinsics is OK) which will fill the memory with a specified int32_t value? For instance, when this value is equal to 0xAABBCC00 the result memory should look like:

AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
...

I could use std::fill or simple for-loop, but it is not fast enough.


Resizing of a vector performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.

Simplified code:

struct X
{
  typedef std::vector<int32_t> int_vec_t;
  int_vec_t buffer;

  X() : buffer( 5000000 ) { /* some more action */ }
  ~X() { /* some code here */ }

  // the following function is called 25 times per second
  const int_vec_t& process( int32_t background, const SOME_DATA& data );
};

const X::int_vec_t& X::process( int32_t background, const SOME_DATA& data )
{
    // the following one string takes 30% of total time of #process function
    std::fill( buffer.begin(), buffer.end(), background );

    // some processing
    // ...

    return buffer;
}
+6  A: 

This is how I would do it (please excuse the Microsoft-ness of it):

VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
{
    __m128i f;

    // Fix mis-alignment.
    if ((ULONG_PTR)M & 0xf)
    {
        switch ((ULONG_PTR)M & 0xf)
        {
            case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }
        }
    }

    f.m128i_i32[0] = Fill;
    f.m128i_i32[1] = Fill;
    f.m128i_i32[2] = Fill;
    f.m128i_i32[3] = Fill;

    while (Count >= 4)
    {
        _mm_store_si128((__m128i *)M, f);
        M += 4;
        Count -= 4;
    }

    // Fill remaining LONGs.
    switch (Count & 0x3)
    {
        case 0x3: *M++ = Fill;
        case 0x2: *M++ = Fill;
        case 0x1: *M++ = Fill;
    }
}
wj32
I would be interested in how do this compare performancewise with std::fill.
Alexandre C.
Sorry, I'm a plain C guy. I don't know anything about std, fill or std::fill.
wj32
+5  A: 

I have to ask: Have you definitely profiled std::fill and shown it to be the performance bottleneck? I would guess it to be implemented in a pretty efficient manner, such that the compiler can automatically generate the appropriate instructions (for example -march on gcc).

If it is the bottleneck, it may still be possible to get better benefit from an algorithmic redesign (if possible) to avoid setting so much memory (apparently over and over) such that it doesn't matter anymore which fill mechanism you use.

Mark B
+2  A: 

Have you considered using

vector<int32_t> myVector;
myVector.reserve( sizeIWant );

and then use std::fill? Or perhaps the constructor of a std::vector which takes as an argument the number of items held and the value to initialize them at?

wheaties
This is a really good point. If you're appending to a vector, part of your overhead may be in resizing that vector, which wouldn't happen every time you inserted to the end (vector auto-expands a little larger than you need it, generally) but it would be often enough to incur a performance hit. Use reserve() to pre-allocate a certain length.
Sean Edwards
You can also use an array to be 100% sure resizing is not the issue.
inflagranti
Yeah, originally I thought he was trying to set values in an array he made with malloc(). I didn't even consider that it might be vector resizing slowing him down. :)
Sean Edwards
Resizing of a `vector` performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.
Kirill V. Lyadvinsky
Wow, you're in a whole new realm. Is a `malloc` and `memcpy` solution too slow?
wheaties
memcpy won't work, because he effectively needs to copy sequences of bytes, not just fill the whole block with a single byte.
Sean Edwards
You can fill as many bytes of one contiguous piece of memory with a direct copy of another contiguous piece of memory.
wheaties
+1  A: 

Not total sure how you set 4 bytes in a row, but if you want to fill memory with just one byte over an over again, you can use memset.

void * memset ( void * ptr, int value, size_t num );

Fill block of memory Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).

Link to example

UnixShadow
I don't want to fill memory with one byte. There're four bytes in `int32_t`.
Kirill V. Lyadvinsky
A: 

Assuming you have a limited amount of values in your background parameter (or even better, only on), maybe you should try to allocate a static vector, and simply use memcpy.

const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);

    const X::int_vec_t& X::process( const SOME_DATA& data )
    {
        // the following one string takes 30% of total time of #process function
        std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));

        // some processing
        // ...

        return buffer;
    }
Viktor Sehr
A: 

It might be a bit non portable but you could use an overlapping memory copy. Fill the first four bytes with the pattern you want and use memcpy().

int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );

don't think you can get much faster

Jay
It's not supported.see http://stackoverflow.com/questions/387654/why-is-there-no-z80-like-ldir-functionality-in-c-c-rtl
EvilTeach
It may not be "supported" but it works in vs2008. I can provide source if needed. I also can't find what you're referring to in the linked page.
Jay
Overlapping memcpy's is a well-known bug, and to recommend them is bad advice. memmove is the correct call to use for overlapping regions. "It's not supported but it works" is for engraving on tombstones.
Will Dean
Sorry, you're misinformed. It is not a bug. Some have called it a bug because they don't comprehend the consequences of the code they wrote. There are explicit warnings to use memmove() if this is not the desired behavior. It works *as designed* using the bulk copy operation of the CPU.
Jay
I double checked and this has been changed to 'not defined behavior' since I last read the documentation. If the poster really wants the fastest way to accomplish his task this is it. To be safe he should compile a program using memcpy and extract the assembler source code and use it. That will prevent the current behavior from being deprecated or changed and breaking his program.
Jay
A: 

I just tested std::fill with g++ with full optimizations (SSE etc.. enabled):

#include <algorithm>
#include <inttypes.h>

int32_t a[5000000];

int main(int argc,char *argv[])
{
    std::fill(a,a+5000000,0xAABBCC00);
    return a[3];
}

and the inner loop looked like:

L2:
    movdqa  %xmm0, -16(%eax)
    addl    $16, %eax
    cmpl    %edx, %eax
    jne L2

Looks like 0xAABBCC00 x 4 was loaded into xmm0 and is being moved 16-bytes at a time.

5ound
I'm curious, why is this code which uses comparisons and conditional jumps still faster than `REPNZ STOS` or similar?
Philipp
@Philipp: it copies 16 bytes at a time. Comparisons and conditional jumps aren't necessarily expensive. It depends a lot on the context, what other instructions are being executed.
jalf
+3  A: 

Thanks to everyone for your answers. I've checked wj32's solution , but it shows very similar time as std::fill do. My current solution works 4 times faster (in Visual Studio 2008) than std::fill with help of the function memcpy:

 // fill the first quarter by the usual way
 std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
 // copy the first quarter to the second (very fast)
 memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
 // copy the first half to the second (very fast)
 memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));

In the production code one needs to add check if buffer.size() is divisible by 4 and add appropriate handling for that.

Kirill V. Lyadvinsky