ansaurus

Question

How to fill memory fast with a `int32_t` value?

Answer 1

+6 A:

This is how I would do it (please excuse the Microsoft-ness of it):

VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
{
    __m128i f;

    // Fix mis-alignment.
    if ((ULONG_PTR)M & 0xf)
    {
        switch ((ULONG_PTR)M & 0xf)
        {
            case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }
        }
    }

    f.m128i_i32[0] = Fill;
    f.m128i_i32[1] = Fill;
    f.m128i_i32[2] = Fill;
    f.m128i_i32[3] = Fill;

    while (Count >= 4)
    {
        _mm_store_si128((__m128i *)M, f);
        M += 4;
        Count -= 4;
    }

    // Fill remaining LONGs.
    switch (Count & 0x3)
    {
        case 0x3: *M++ = Fill;
        case 0x2: *M++ = Fill;
        case 0x1: *M++ = Fill;
    }
}

wj32 2010-07-09 12:57:21

I would be interested in how do this compare performancewise with std::fill.

Alexandre C. 2010-07-09 13:00:53

Sorry, I'm a plain C guy. I don't know anything about std, fill or std::fill.

wj32 2010-07-09 13:05:00

Answer 2

+5 A:

I have to ask: Have you definitely profiled std::fill and shown it to be the performance bottleneck? I would guess it to be implemented in a pretty efficient manner, such that the compiler can automatically generate the appropriate instructions (for example -march on gcc).

If it is the bottleneck, it may still be possible to get better benefit from an algorithmic redesign (if possible) to avoid setting so much memory (apparently over and over) such that it doesn't matter anymore which fill mechanism you use.

Mark B 2010-07-09 13:43:05

Answer 3

+2 A:

Have you considered using

vector<int32_t> myVector;
myVector.reserve( sizeIWant );

and then use std::fill? Or perhaps the constructor of a std::vector which takes as an argument the number of items held and the value to initialize them at?

wheaties 2010-07-09 13:46:15

This is a really good point. If you're appending to a vector, part of your overhead may be in resizing that vector, which wouldn't happen every time you inserted to the end (vector auto-expands a little larger than you need it, generally) but it would be often enough to incur a performance hit. Use reserve() to pre-allocate a certain length.

Sean Edwards 2010-07-09 13:50:40

You can also use an array to be 100% sure resizing is not the issue.

inflagranti 2010-07-09 13:54:33

Yeah, originally I thought he was trying to set values in an array he made with malloc(). I didn't even consider that it might be vector resizing slowing him down. :)

Sean Edwards 2010-07-09 13:59:32

Resizing of a `vector` performed only once in the beginning of program, this is not an issue. The bottleneck is filling the memory.

Kirill V. Lyadvinsky 2010-07-09 14:02:25

Wow, you're in a whole new realm. Is a `malloc` and `memcpy` solution too slow?

wheaties 2010-07-09 14:09:28

memcpy won't work, because he effectively needs to copy sequences of bytes, not just fill the whole block with a single byte.

Sean Edwards 2010-07-09 14:22:48

You can fill as many bytes of one contiguous piece of memory with a direct copy of another contiguous piece of memory.

wheaties 2010-07-09 17:37:41

Answer 4

+1 A:

Not total sure how you set 4 bytes in a row, but if you want to fill memory with just one byte over an over again, you can use memset.

void * memset ( void * ptr, int value, size_t num );

Fill block of memory Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).

Link to example

UnixShadow 2010-07-09 14:11:45

I don't want to fill memory with one byte. There're four bytes in `int32_t`.

Kirill V. Lyadvinsky 2010-07-09 14:22:45

Answer 5

A:

Assuming you have a limited amount of values in your background parameter (or even better, only on), maybe you should try to allocate a static vector, and simply use memcpy.

const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);

    const X::int_vec_t& X::process( const SOME_DATA& data )
    {
        // the following one string takes 30% of total time of #process function
        std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));

        // some processing
        // ...

        return buffer;
    }

Viktor Sehr 2010-07-09 14:37:22

Answer 6

A:

It might be a bit non portable but you could use an overlapping memory copy. Fill the first four bytes with the pattern you want and use memcpy().

int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );

don't think you can get much faster

Jay 2010-07-09 14:41:44

It's not supported.see http://stackoverflow.com/questions/387654/why-is-there-no-z80-like-ldir-functionality-in-c-c-rtl

EvilTeach 2010-07-09 15:31:22

It may not be "supported" but it works in vs2008. I can provide source if needed. I also can't find what you're referring to in the linked page.

Jay 2010-07-12 18:17:05

Overlapping memcpy's is a well-known bug, and to recommend them is bad advice. memmove is the correct call to use for overlapping regions. "It's not supported but it works" is for engraving on tombstones.

Will Dean 2010-07-19 07:21:12

Sorry, you're misinformed. It is not a bug. Some have called it a bug because they don't comprehend the consequences of the code they wrote. There are explicit warnings to use memmove() if this is not the desired behavior. It works *as designed* using the bulk copy operation of the CPU.

Jay 2010-07-19 14:35:30

I double checked and this has been changed to 'not defined behavior' since I last read the documentation. If the poster really wants the fastest way to accomplish his task this is it. To be safe he should compile a program using memcpy and extract the assembler source code and use it. That will prevent the current behavior from being deprecated or changed and breaking his program.

Jay 2010-07-19 18:06:19

Answer 7

A:

I just tested std::fill with g++ with full optimizations (SSE etc.. enabled):

#include <algorithm>
#include <inttypes.h>

int32_t a[5000000];

int main(int argc,char *argv[])
{
    std::fill(a,a+5000000,0xAABBCC00);
    return a[3];
}

and the inner loop looked like:

L2:
    movdqa  %xmm0, -16(%eax)
    addl    $16, %eax
    cmpl    %edx, %eax
    jne L2

Looks like 0xAABBCC00 x 4 was loaded into xmm0 and is being moved 16-bytes at a time.

5ound 2010-07-09 15:18:59

I'm curious, why is this code which uses comparisons and conditional jumps still faster than `REPNZ STOS` or similar?

Philipp 2010-07-19 07:44:02

@Philipp: it copies 16 bytes at a time. Comparisons and conditional jumps aren't necessarily expensive. It depends a lot on the context, what other instructions are being executed.

jalf 2010-07-19 09:32:24

Answer 8

+3 A:

Thanks to everyone for your answers. I've checked wj32's solution , but it shows very similar time as std::fill do. My current solution works 4 times faster (in Visual Studio 2008) than std::fill with help of the function memcpy:

 // fill the first quarter by the usual way
 std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
 // copy the first quarter to the second (very fast)
 memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
 // copy the first half to the second (very fast)
 memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));

In the production code one needs to add check if buffer.size() is divisible by 4 and add appropriate handling for that.

Kirill V. Lyadvinsky 2010-07-19 07:14:39

ansaurus

tags:

views:

answers:

How to fill memory fast with a `int32_t` value?

related questions