views:

579

answers:

3

I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out.

Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using _mm_stream_ps to write the value back out to memory. However I have some odd performance results.

In my first scheme I use the _mm_set_epi8 intrinsic to set my __m128i as follows:

    const __m128i packedSamples = _mm_set_epi8( sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8,
                                                sample7,    sample6, sample5, sample4, sample3, sample2, sample1, sample0 );

Basically I leave it all up to the compiler to decide how to optimise it to give best performance. This gives WORST performance. MY test runs in ~0.195 seconds.

Second I tried to merge down by using 4 _mm_set_epi32 instructions and then packing them down:

    const __m128i samples0      = _mm_set_epi32( sample3, sample2, sample1, sample0 );
    const __m128i samples1      = _mm_set_epi32( sample7, sample6, sample5, sample4 );
    const __m128i samples2      = _mm_set_epi32( sample11, sample10, sample9, sample8 );
    const __m128i samples3      = _mm_set_epi32( sample15, sample14, sample13, sample12 );

    const __m128i packedSamples0    = _mm_packs_epi32( samples0, samples1 );
    const __m128i packedSamples1    = _mm_packs_epi32( samples2, samples3 );
    const __m128i packedSamples     = _mm_packus_epi16( packedSamples0, packedSamples1 );

This does improve performance somewhat. My test now runs in ~0.15 seconds. Seems counter-intuitive that performance would improve by doing this as I assume this is exactly what _mm_set_epi8 is doing anyway ...

My final attempt was to use a bit of code I have from making four CCs the old fashioned way (with shifts and ors) and then putting them in an __m128i using a single _mm_set_epi32.

    const GCui32 samples0       = MakeFourCC( sample0, sample1, sample2, sample3 );
    const GCui32 samples1       = MakeFourCC( sample4, sample5, sample6, sample7 );
    const GCui32 samples2       = MakeFourCC( sample8, sample9, sample10, sample11 );
    const GCui32 samples3       = MakeFourCC( sample12, sample13, sample14, sample15 );
    const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );

This gives even BETTER performance. Taking ~0.135 seconds to run my test. I'm really starting to get confused.

So I tried a simple read byte write byte system and that is ever-so-slightly faster than even the last method.

So what is going on? This all seems counter-intuitive to me.

I've considered the idea that the delays are occuring on the _mm_stream_ps because I'm supplying data too quickly but then I would to get exactly the same results out whatever I do. Is it possible that the first 2 methods mean that the 16 loads can't get distributed through the loop to hide latency? If so why is this? Surely an intrinsic allows the compiler to make optimisations as and where it pleases .. i thought that was the whole point ... Also surely performing 16 reads and 16 writes will be much slower than 16 reads and 1 write with a bunch of SSE juggling instructions ... After all its the reads and writes that are the slow bit!

Anyone with any ideas whats going on will be much appreciated! :D

Edit: Further to the comment below I stopped pre-loading the bytes as constants and changedit to this:

    const __m128i samples0      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples1      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples2      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;
    const __m128i samples3      = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
    pSamples    += channelStep4;

    const __m128i packedSamples0    = _mm_packs_epi32( samples0, samples1 );
    const __m128i packedSamples1    = _mm_packs_epi32( samples2, samples3 );
    const __m128i packedSamples     = _mm_packus_epi16( packedSamples0, packedSamples1 );

and this improved performance to ~0.143 seconds. Sitll not as good as the straight C implementation ...

Edit Again: The best performance I'm getting thus far is

// Load the samples. const GCui8 sample0 = *(pSamples + channelStep0); const GCui8 sample1 = *(pSamples + channelStep1); const GCui8 sample2 = *(pSamples + channelStep2); const GCui8 sample3 = *(pSamples + channelStep3);

    const GCui32 samples0   = Build32( sample0, sample1, sample2, sample3 );
    pSamples += channelStep4;

    const GCui8 sample4     = *(pSamples + channelStep0);
    const GCui8 sample5     = *(pSamples + channelStep1);
    const GCui8 sample6     = *(pSamples + channelStep2);
    const GCui8 sample7     = *(pSamples + channelStep3);

    const GCui32 samples1   = Build32( sample4, sample5, sample6, sample7 );
    pSamples += channelStep4;

    // Load the samples.
    const GCui8 sample8     = *(pSamples + channelStep0);
    const GCui8 sample9     = *(pSamples + channelStep1);
    const GCui8 sample10    = *(pSamples + channelStep2);
    const GCui8 sample11    = *(pSamples + channelStep3);

    const GCui32 samples2       = Build32( sample8, sample9, sample10, sample11 );
    pSamples += channelStep4;

    const GCui8 sample12    = *(pSamples + channelStep0);
    const GCui8 sample13    = *(pSamples + channelStep1);
    const GCui8 sample14    = *(pSamples + channelStep2);
    const GCui8 sample15    = *(pSamples + channelStep3);

    const GCui32 samples3   = Build32( sample12, sample13, sample14, sample15 );
    pSamples += channelStep4;

    const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );

    _mm_stream_ps( pWrite + 0,  *(__m128*)&packedSamples ); 

This gives me processing in ~0.095 seconds which is considerably better. I don't appear to be able to get close with SSE though ... I'm still confused by that but .. ho hum.

+2  A: 

Perhaps the compiler is trying to put all the arguments to the intrinsic into registers at once. You don't want to access that many variables at once without organizing them.

Rather than declare a separate identifier for each sample, try putting them into a char[16]. The compiler will promote the 16 values to registers as it sees fit, as long as you don't take the address of anything within the array. You can add an __aligned__ tag (or whatever VC++ uses) and maybe avoid the intrinsic altogether. Otherwise, calling the intrinsic with ( sample[15], sample[14], sample[13] … sample[0] ) should make the compiler's job easier or at least do no harm.


Edit: I'm pretty sure you're fighting a register spill but that suggestion will probably just store the bytes individually, which isn't what you want. I think my advice is to interleave your final attempt (using MakeFourCC) with the read operations, to make sure it's scheduled correctly and with no round-trips to the stack. Of course, inspection of object code is the best way to ensure that.

Essentially, you are streaming data into the register file and then streaming it back out. You don't want to overload it before it's time to flush the data.

Potatoswatter
thing is by doing this I may as well write them all straight to memory. It is giving me ideas though ... am beginning to think I could get better performance by writing some simple assembler. I just wanted to avoid an assembler block for 64-bit reasons ... I really hoped the compiler would take care of this for me ... my mistake ;)
Goz
That's why I made the edit… the real key point is to ensure that the bytes are assembled as they arrive. Then you have at most three 4-byte variables and two 2-byte ones (since x86 can address high/low bytes already) for five registers max before you call `_mm_set_epi32`.
Potatoswatter
I just tried exactly what you say in your edit. Suddenly execution time is down to ~0.095 seconds. I thought the compiler would perform this sort of re-ordering but it seems not ... ouch. (This is for the MakeFourCC code, using the 2nd code attempt im still back to ~0.143 seconds)
Goz
A: 

Using intrinsics breaks compiler optimisations!

The whole point of the intrinsic functions is to insert opcodes the compiler doesn't know about into the stream of opcodes the compiler does know about and has generated. Unless the compiler is given some meta data about the opcode and how it affects the registers and memory, the compiler can't assume that any data is preserved after executing the intrinsic. This really hurts the optimising part of the compiler - it can't reorder instructions around the intrinsic, it can't assume registers are unaffected and so on.

I think the best way to optimise this is to look at the bigger picture - you need to consider the whole process from reading the source data to writing the final output. Micro optimisations rarely give big results, unless you're doing something really badly to start with.

Perhaps, if you detail the required input and output someone here could suggest an optimal method to handle it.

Skizz
I'm pretty sure that intrinsics DON'T break optimisation. Thats the whole point of them. Using an __asm block DOES break optimisation which is why microsoft offered intrinsics in the first place. This link appears to agree with me ... http://blogs.msdn.com/vcblog/archive/2007/10/18/new-intrinsic-support-in-visual-studio-2008.aspx
Goz
Skizz, have you ever written SIMD code? I personally do prefer to avoid intrinsics, but the alternatives are even less portable and riskier.
Potatoswatter
@Goz: I'll edit my post, but, what I tried to say was that if the compiler does not know what the intrinsic does it would be the same as an __asm block. The intrinsics in DevStudio may well be known to the compiler and so the compiler can optimise around them. If the intrinsic function is just a wrapper around an __asm block then the compiler is stuck unable to optimise well. If it's a call to a library then there's little point using it for optimised code.
Skizz
@Potatoswatter: Yes, I have written SIMD code (even in some of my answers to SO questions). Some say using intrinsics is a bit more portable than using straight asm but I think that if you're using intrinsics you're already aiming at a sub-set of available CPUs so just use the asm. OK, asm uses terse mnemonics but you need to know how the instructions work to take advantage of them so you've done the hard part.
Skizz
Perhaps that's true of VS (I doubt it), but generally, the compiler is as aware of the semantics of the intrinsic function and the IR of the functionality inside it as for any other function. Asm blocks are different. You CAN generally expect an intrinsic to be scheduled, and that is what makes them desirable.
Potatoswatter
+2  A: 

VS is notoriously bad at optimizing intrinsics. Especially moving data from and to SSE registers. The intrinsics itself are used pretty well however ... .

What you see is that it is trying to fill the SSE register with this monster :

00AA100C  movzx       ecx,byte ptr [esp+0Fh]  
00AA1011  movzx       edx,byte ptr [esp+0Fh]  
00AA1016  movzx       eax,byte ptr [esp+0Fh]  
00AA101B  movd        xmm0,eax  
00AA101F  movzx       eax,byte ptr [esp+0Fh]  
00AA1024  movd        xmm2,edx  
00AA1028  movzx       edx,byte ptr [esp+0Fh]  
00AA102D  movd        xmm1,ecx  
00AA1031  movzx       ecx,byte ptr [esp+0Fh]  
00AA1036  movd        xmm4,ecx  
00AA103A  movzx       ecx,byte ptr [esp+0Fh]  
00AA103F  movd        xmm5,edx  
00AA1043  movzx       edx,byte ptr [esp+0Fh]  
00AA1048  movd        xmm3,eax  
00AA104C  movzx       eax,byte ptr [esp+0Fh]  
00AA1051  movdqa      xmmword ptr [esp+60h],xmm0  
00AA1057  movd        xmm0,edx  
00AA105B  movzx       edx,byte ptr [esp+0Fh]  
00AA1060  movd        xmm6,eax  
00AA1064  movzx       eax,byte ptr [esp+0Fh]  
00AA1069  movd        xmm7,ecx  
00AA106D  movzx       ecx,byte ptr [esp+0Fh]  
00AA1072  movdqa      xmmword ptr [esp+20h],xmm4  
00AA1078  movdqa      xmmword ptr [esp+80h],xmm0  
00AA1081  movd        xmm4,ecx  
00AA1085  movzx       ecx,byte ptr [esp+0Fh]  
00AA108A  movdqa      xmmword ptr [esp+70h],xmm2  
00AA1090  movd        xmm0,eax  
00AA1094  movzx       eax,byte ptr [esp+0Fh]  
00AA1099  movdqa      xmmword ptr [esp+10h],xmm4  
00AA109F  movdqa      xmmword ptr [esp+50h],xmm6  
00AA10A5  movd        xmm2,edx  
00AA10A9  movzx       edx,byte ptr [esp+0Fh]  
00AA10AE  movd        xmm4,eax  
00AA10B2  movzx       eax,byte ptr [esp+0Fh]  
00AA10B7  movd        xmm6,edx  
00AA10BB  punpcklbw   xmm0,xmm1  
00AA10BF  punpcklbw   xmm2,xmm3  
00AA10C3  movdqa      xmm3,xmmword ptr [esp+80h]  
00AA10CC  movdqa      xmmword ptr [esp+40h],xmm4  
00AA10D2  movd        xmm4,ecx  
00AA10D6  movdqa      xmmword ptr [esp+30h],xmm6  
00AA10DC  movdqa      xmm1,xmmword ptr [esp+30h]  
00AA10E2  movd        xmm6,eax  
00AA10E6  punpcklbw   xmm4,xmm5  
00AA10EA  punpcklbw   xmm4,xmm0  
00AA10EE  movdqa      xmm0,xmmword ptr [esp+50h]  
00AA10F4  punpcklbw   xmm1,xmm0  
00AA10F8  movdqa      xmm0,xmmword ptr [esp+70h]  
00AA10FE  punpcklbw   xmm6,xmm7  
00AA1102  punpcklbw   xmm6,xmm2  
00AA1106  movdqa      xmm2,xmmword ptr [esp+10h]  
00AA110C  punpcklbw   xmm2,xmm0  
00AA1110  movdqa      xmm0,xmmword ptr [esp+20h]  
00AA1116  punpcklbw   xmm1,xmm2  
00AA111A  movdqa      xmm2,xmmword ptr [esp+40h]  
00AA1120  punpcklbw   xmm2,xmm0  
00AA1124  movdqa      xmm0,xmmword ptr [esp+60h]  
00AA112A  punpcklbw   xmm3,xmm0  
00AA112E  punpcklbw   xmm2,xmm3  
00AA1132  punpcklbw   xmm6,xmm4  
00AA1136  punpcklbw   xmm1,xmm2  
00AA113A  punpcklbw   xmm6,xmm1  

This works much better and (should) easily be faster :

__declspec(align(16)) BYTE arr[16] = { sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8, sample7, sample6, sample5, sample4, sample3, sample2, sample1, sample0 };

__m128i packedSamples = _mm_load_si128( (__m128i*)arr );

Build my own test-bed :

void    f()
{
    const int steps = 1000000;
    BYTE* pDest = new BYTE[steps*16+16];
    pDest += 16 - ((ULONG_PTR)pDest % 16);
    BYTE* pSrc = new BYTE[steps*16*16];

    const int channelStep0 = 0;
    const int channelStep1 = 1;
    const int channelStep2 = 2;
    const int channelStep3 = 3;
    const int channelStep4 = 16;

    __int64 freq;
    QueryPerformanceFrequency( (LARGE_INTEGER*)&freq );
    __int64 start = 0, end;
    QueryPerformanceCounter( (LARGE_INTEGER*)&start );

    for( int step = 0; step < steps; ++step )
    {
        __declspec(align(16)) BYTE arr[16];
        for( int j = 0; j < 4; ++j )
        {
            //for( int i = 0; i < 4; ++i )
            {
                arr[0+j*4] = *(pSrc + channelStep0);
                arr[1+j*4] = *(pSrc + channelStep1);
                arr[2+j*4] = *(pSrc + channelStep2);
                arr[3+j*4] = *(pSrc + channelStep3);
            }
            pSrc += channelStep4;
        }

#if test1
// test 1 with C
        for( int i = 0; i < 16; ++i )
        {
            *(pDest + step * 16 + i) = arr[i];
        }
#else
// test 2 with SSE load/store    
        __m128i packedSamples = _mm_load_si128( (__m128i*)arr );
        _mm_stream_si128( ((__m128i*)pDest) + step, packedSamples );
#endif
    }

    QueryPerformanceCounter( (LARGE_INTEGER*)&end );

    printf( "%I64d", (end - start) * 1000 / freq );

}

For me test 2 is faster then test 1.

Do I do something wrong? Is this not the code you are using? What do I miss? Is this just for me?

Christopher
Yup thats definitely the fastest SSE based implementation so far (~0.124 seconds). But if you check my last edit you'll see that avoiding SSE completely provided me with a speed boost that beats even that hands down. Thanks a lot though. Its still very useful. There is a reason I prefer to just write the bloody things in assembler ;)
Goz
Actually I lie .. I implemented that slightly wrongly (Should have unit tested the result) ... strangely that doesn't provide any speed boost ... it generates much the same code ...
Goz
And I try a slightly different implementation whereby each load is "*(pSamples += channelStep)," (except the first one obviously)and now I'm getting 0.13 seconds ...which is good but still not great ...
Goz