views:

176

answers:

2

I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:

At the beginning of my program, I create an object with member:

static __m128 *m_sincos;

then I initilize that member in the constructor:

m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
  m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));



When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned

movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash

-The variables do not seem to be correct

movaps result, xmm0 // returns values that are not what is in m_sincos[t]
//Although, putting a watch on m_sincos[t] displays the correct values

-What really confuses me is that this makes everything work (but is too slow):

__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0


Thanks for any help that you might be able to provide,

+8  A: 

m_sincos[t] is a C expression. In an assembly instruction, however, (__asm?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:

movaps xmm0, m_sincos[t]

into: (see the disassembly window when the app crashes in debug mode)

movaps xmm0, xmmword ptr [t]

That interpretation attempts to copy a 128-bit value stored at the address of the variable t into xmm0. t, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t's address is aligned.

You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:

__asm mov eax, m_sincos                  ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4                         ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]

Sidenote:

When I put this in a complete program, something odd occurs:

#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>

int main()
{
    static __m128 *m_sincos;
    int Bins = 4;

    m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
    for (int t=0; t<Bins; t++) {
        m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
        __asm movaps xmm0, m_sincos[t];
        __asm mov eax, m_sincos
        __asm mov ebx, t
        __asm shl ebx, 4
        __asm movaps xmm0, [eax+ebx];
    }

    return 0;
}

When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0 is getting the correct value before the movaps instruction is executed. How does that happen?

A look at the generated assembly code shows that _mm_set_ps() loads the sin/cos results into xmm0, then saves it to the memory address of m_sincos[t]. But the value remains there in xmm0 too. _mm_set_ps is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.

If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.

Oren Trutner
@Oren Trutner - Wow, that's probably the best answer that I've read in all of my searching, thanks for the clear explanation! So, if I wanted to use assembly throughout, does that mean that I would have to do the shl instruction to move to the correct position in my array just as you do with the intrinsics? Thanks very much!!
Brett
Yes, you need to multiply the array index by 16 to get the correct offset. x86 has a number of addressing modes that multiply indices for you, avoiding the need to shift explicitly. I could not, however, find one that would multiply by 16. Doesn't mean there isn't one, just that I didn't find it. An alternative would be to increment the index by 16 on each iteration.
Oren Trutner
learned something new today. thank you
aaa
+1  A: 

You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.

DeadMG
Thanks for the suggestion, I was just reading into that when you posted!
Brett