ansaurus

Question

VC++ SSE intrinsic optimisation weirdness

Answer 1

+2 A:

Perhaps the compiler is trying to put all the arguments to the intrinsic into registers at once. You don't want to access that many variables at once without organizing them.

Rather than declare a separate identifier for each sample, try putting them into a char[16]. The compiler will promote the 16 values to registers as it sees fit, as long as you don't take the address of anything within the array. You can add an __aligned__ tag (or whatever VC++ uses) and maybe avoid the intrinsic altogether. Otherwise, calling the intrinsic with ( sample[15], sample[14], sample[13] … sample[0] ) should make the compiler's job easier or at least do no harm.

Edit: I'm pretty sure you're fighting a register spill but that suggestion will probably just store the bytes individually, which isn't what you want. I think my advice is to interleave your final attempt (using MakeFourCC) with the read operations, to make sure it's scheduled correctly and with no round-trips to the stack. Of course, inspection of object code is the best way to ensure that.

Essentially, you are streaming data into the register file and then streaming it back out. You don't want to overload it before it's time to flush the data.

Potatoswatter 2010-01-05 10:19:24

thing is by doing this I may as well write them all straight to memory. It is giving me ideas though ... am beginning to think I could get better performance by writing some simple assembler. I just wanted to avoid an assembler block for 64-bit reasons ... I really hoped the compiler would take care of this for me ... my mistake ;)

Goz 2010-01-05 10:35:42

That's why I made the edit… the real key point is to ensure that the bytes are assembled as they arrive. Then you have at most three 4-byte variables and two 2-byte ones (since x86 can address high/low bytes already) for five registers max before you call `_mm_set_epi32`.

Potatoswatter 2010-01-05 10:51:36

I just tried exactly what you say in your edit. Suddenly execution time is down to ~0.095 seconds. I thought the compiler would perform this sort of re-ordering but it seems not ... ouch. (This is for the MakeFourCC code, using the 2nd code attempt im still back to ~0.143 seconds)

Goz 2010-01-05 10:52:16

Answer 2

A:

Using intrinsics breaks compiler optimisations!

The whole point of the intrinsic functions is to insert opcodes the compiler doesn't know about into the stream of opcodes the compiler does know about and has generated. Unless the compiler is given some meta data about the opcode and how it affects the registers and memory, the compiler can't assume that any data is preserved after executing the intrinsic. This really hurts the optimising part of the compiler - it can't reorder instructions around the intrinsic, it can't assume registers are unaffected and so on.

I think the best way to optimise this is to look at the bigger picture - you need to consider the whole process from reading the source data to writing the final output. Micro optimisations rarely give big results, unless you're doing something really badly to start with.

Perhaps, if you detail the required input and output someone here could suggest an optimal method to handle it.

Skizz 2010-01-05 11:30:56

I'm pretty sure that intrinsics DON'T break optimisation. Thats the whole point of them. Using an __asm block DOES break optimisation which is why microsoft offered intrinsics in the first place. This link appears to agree with me ... http://blogs.msdn.com/vcblog/archive/2007/10/18/new-intrinsic-support-in-visual-studio-2008.aspx

Goz 2010-01-05 11:46:29

Skizz, have you ever written SIMD code? I personally do prefer to avoid intrinsics, but the alternatives are even less portable and riskier.

Potatoswatter 2010-01-05 12:14:53

@Goz: I'll edit my post, but, what I tried to say was that if the compiler does not know what the intrinsic does it would be the same as an __asm block. The intrinsics in DevStudio may well be known to the compiler and so the compiler can optimise around them. If the intrinsic function is just a wrapper around an __asm block then the compiler is stuck unable to optimise well. If it's a call to a library then there's little point using it for optimised code.

Skizz 2010-01-05 13:44:28

@Potatoswatter: Yes, I have written SIMD code (even in some of my answers to SO questions). Some say using intrinsics is a bit more portable than using straight asm but I think that if you're using intrinsics you're already aiming at a sub-set of available CPUs so just use the asm. OK, asm uses terse mnemonics but you need to know how the instructions work to take advantage of them so you've done the hard part.

Skizz 2010-01-05 13:49:45

Perhaps that's true of VS (I doubt it), but generally, the compiler is as aware of the semantics of the intrinsic function and the IR of the functionality inside it as for any other function. Asm blocks are different. You CAN generally expect an intrinsic to be scheduled, and that is what makes them desirable.

Potatoswatter 2010-01-06 19:25:11

Answer 3

+2 A:

VS is notoriously bad at optimizing intrinsics. Especially moving data from and to SSE registers. The intrinsics itself are used pretty well however ... .

What you see is that it is trying to fill the SSE register with this monster :

00AA100C  movzx       ecx,byte ptr [esp+0Fh]  
00AA1011  movzx       edx,byte ptr [esp+0Fh]  
00AA1016  movzx       eax,byte ptr [esp+0Fh]  
00AA101B  movd        xmm0,eax  
00AA101F  movzx       eax,byte ptr [esp+0Fh]  
00AA1024  movd        xmm2,edx  
00AA1028  movzx       edx,byte ptr [esp+0Fh]  
00AA102D  movd        xmm1,ecx  
00AA1031  movzx       ecx,byte ptr [esp+0Fh]  
00AA1036  movd        xmm4,ecx  
00AA103A  movzx       ecx,byte ptr [esp+0Fh]  
00AA103F  movd        xmm5,edx  
00AA1043  movzx       edx,byte ptr [esp+0Fh]  
00AA1048  movd        xmm3,eax  
00AA104C  movzx       eax,byte ptr [esp+0Fh]  
00AA1051  movdqa      xmmword ptr [esp+60h],xmm0  
00AA1057  movd        xmm0,edx  
00AA105B  movzx       edx,byte ptr [esp+0Fh]  
00AA1060  movd        xmm6,eax  
00AA1064  movzx       eax,byte ptr [esp+0Fh]  
00AA1069  movd        xmm7,ecx  
00AA106D  movzx       ecx,byte ptr [esp+0Fh]  
00AA1072  movdqa      xmmword ptr [esp+20h],xmm4  
00AA1078  movdqa      xmmword ptr [esp+80h],xmm0  
00AA1081  movd        xmm4,ecx  
00AA1085  movzx       ecx,byte ptr [esp+0Fh]  
00AA108A  movdqa      xmmword ptr [esp+70h],xmm2  
00AA1090  movd        xmm0,eax  
00AA1094  movzx       eax,byte ptr [esp+0Fh]  
00AA1099  movdqa      xmmword ptr [esp+10h],xmm4  
00AA109F  movdqa      xmmword ptr [esp+50h],xmm6  
00AA10A5  movd        xmm2,edx  
00AA10A9  movzx       edx,byte ptr [esp+0Fh]  
00AA10AE  movd        xmm4,eax  
00AA10B2  movzx       eax,byte ptr [esp+0Fh]  
00AA10B7  movd        xmm6,edx  
00AA10BB  punpcklbw   xmm0,xmm1  
00AA10BF  punpcklbw   xmm2,xmm3  
00AA10C3  movdqa      xmm3,xmmword ptr [esp+80h]  
00AA10CC  movdqa      xmmword ptr [esp+40h],xmm4  
00AA10D2  movd        xmm4,ecx  
00AA10D6  movdqa      xmmword ptr [esp+30h],xmm6  
00AA10DC  movdqa      xmm1,xmmword ptr [esp+30h]  
00AA10E2  movd        xmm6,eax  
00AA10E6  punpcklbw   xmm4,xmm5  
00AA10EA  punpcklbw   xmm4,xmm0  
00AA10EE  movdqa      xmm0,xmmword ptr [esp+50h]  
00AA10F4  punpcklbw   xmm1,xmm0  
00AA10F8  movdqa      xmm0,xmmword ptr [esp+70h]  
00AA10FE  punpcklbw   xmm6,xmm7  
00AA1102  punpcklbw   xmm6,xmm2  
00AA1106  movdqa      xmm2,xmmword ptr [esp+10h]  
00AA110C  punpcklbw   xmm2,xmm0  
00AA1110  movdqa      xmm0,xmmword ptr [esp+20h]  
00AA1116  punpcklbw   xmm1,xmm2  
00AA111A  movdqa      xmm2,xmmword ptr [esp+40h]  
00AA1120  punpcklbw   xmm2,xmm0  
00AA1124  movdqa      xmm0,xmmword ptr [esp+60h]  
00AA112A  punpcklbw   xmm3,xmm0  
00AA112E  punpcklbw   xmm2,xmm3  
00AA1132  punpcklbw   xmm6,xmm4  
00AA1136  punpcklbw   xmm1,xmm2  
00AA113A  punpcklbw   xmm6,xmm1

This works much better and (should) easily be faster :

__declspec(align(16)) BYTE arr[16] = { sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8, sample7, sample6, sample5, sample4, sample3, sample2, sample1, sample0 };

__m128i packedSamples = _mm_load_si128( (__m128i*)arr );

Build my own test-bed :

void    f()
{
    const int steps = 1000000;
    BYTE* pDest = new BYTE[steps*16+16];
    pDest += 16 - ((ULONG_PTR)pDest % 16);
    BYTE* pSrc = new BYTE[steps*16*16];

    const int channelStep0 = 0;
    const int channelStep1 = 1;
    const int channelStep2 = 2;
    const int channelStep3 = 3;
    const int channelStep4 = 16;

    __int64 freq;
    QueryPerformanceFrequency( (LARGE_INTEGER*)&freq );
    __int64 start = 0, end;
    QueryPerformanceCounter( (LARGE_INTEGER*)&start );

    for( int step = 0; step < steps; ++step )
    {
        __declspec(align(16)) BYTE arr[16];
        for( int j = 0; j < 4; ++j )
        {
            //for( int i = 0; i < 4; ++i )
            {
                arr[0+j*4] = *(pSrc + channelStep0);
                arr[1+j*4] = *(pSrc + channelStep1);
                arr[2+j*4] = *(pSrc + channelStep2);
                arr[3+j*4] = *(pSrc + channelStep3);
            }
            pSrc += channelStep4;
        }

#if test1
// test 1 with C
        for( int i = 0; i < 16; ++i )
        {
            *(pDest + step * 16 + i) = arr[i];
        }
#else
// test 2 with SSE load/store    
        __m128i packedSamples = _mm_load_si128( (__m128i*)arr );
        _mm_stream_si128( ((__m128i*)pDest) + step, packedSamples );
#endif
    }

    QueryPerformanceCounter( (LARGE_INTEGER*)&end );

    printf( "%I64d", (end - start) * 1000 / freq );

}

For me test 2 is faster then test 1.

Do I do something wrong? Is this not the code you are using? What do I miss? Is this just for me?

Christopher 2010-01-05 12:46:36

Yup thats definitely the fastest SSE based implementation so far (~0.124 seconds). But if you check my last edit you'll see that avoiding SSE completely provided me with a speed boost that beats even that hands down. Thanks a lot though. Its still very useful. There is a reason I prefer to just write the bloody things in assembler ;)

Goz 2010-01-05 12:57:21

Actually I lie .. I implemented that slightly wrongly (Should have unit tested the result) ... strangely that doesn't provide any speed boost ... it generates much the same code ...

Goz 2010-01-05 13:12:08

And I try a slightly different implementation whereby each load is "*(pSamples += channelStep)," (except the first one obviously)and now I'm getting 0.13 seconds ...which is good but still not great ...

Goz 2010-01-05 13:19:37

ansaurus

tags:

views:

answers:

VC++ SSE intrinsic optimisation weirdness

related questions