views:

106

answers:

3

I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines:

float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;

all a, b, c... are floats.

I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is:

Vector4 abcd, efgh, result;
abcd = [float a, float b, float c, float d];
efgh = [float e, float f, float g, float h];
_asm {
movups xmm1, abcd
movups xmm2, efgh
mulps xmm1, xmm2
movups result, xmm1
}

I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE.

Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code?

I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit.

Thanks in advance!

+1  A: 

You can enable the use of SSE and SSE2 in the program options in newer VS versions and possibly in 2005. Compile using an express version?

Also, your code in SSE is probably slower because when you compile serial C++, the compiler is smart and does a very good job on making it very fast- for example, automatically putting them in the right registers at the right time. If the operations occur in serial, the compiler can reduce the impact of caching and paging, for example. Inline assembler however can be optimized poorly at best and should be avoided whenever possible.

In addition, you'd have to be performing a HUGE amount of work for SSE/2 to bring a notable benefit.

DeadMG
I guess what still confuses me is the fact that I've gotten some working SSE/2 code (I've had many versions of the code pasted above), and its actually gone slower than my serial code. Enough so that my ~10 second program (written in completely serial) then takes ~11.5 seconds (with just that operation in SSE/2)
Brett
Compiler, learn to love it. :P
DeadMG
+2  A: 

you are using unaligned instructions, which are very slow. You may want to try aligning your data correctly, 16-byte boundary, and using movaps. You are better alternative is to use intrinsics, rather than assembly, because then compiler is free to order instructions as it seems necessary.

aaa
so, I tested what I think you're saying, by using a movups command to store the values aligned in the register, then used movaps to simulate having aligned data, and it is finally faster then the serial c++ code so long as I start my timer after aligning the data. If I always start with unaligned data, would it make sense for me not to see a benefit from SSE/ASM?
Brett
+2  A: 

As you've found out, just replacing a couple of instructions with SSE is not going to work because you need to shuffle the data around in memory in order to load the SSE registers correctly, and this moving data around in memory (the bit that constructs the arrays) is going to kill your performance as memory is very slow (hard disk aside, memory is invariably the bottleneck these days).

Also, there is no way to move data between the SSE and the FPU/ALU without using a write to RAM followed by a read. Modern IA32 chips cope well with this particular pattern (write then read) but will still invalidate some cache which will have a knock on effect.

To get the best out of SSE you need to look at the whole algorithm and the data the algorithm uses. The values of a,b,c and d and e, f, g and h need to permanently in those arrays so that there is no shifting data around in memory prior to loading the SSE registers. It is not straightforward and may require a lot of reworking of your code and data (you may need to store the data differently on disk).

It might also be worth pointing out the SSE is only 32bit (or 64bit if you use doubles) whereas the FPU is 80bit (regardless of float or double) so you will get slightly different results when using SSE compared to using the FPU. Only you know if this will be an issue.

Skizz
From what I understand of your answer, it sounds like I should only try to utilize the intrinsics if I can use them for more than one calculation, is this correct? And the reasoning behind that is because I'm not very efficient at moving the data around on my own? I'm unable to store the values of a,b,c, and d and e, f, g, and h in those arrays permanently as they need to load current values for each calculation, therefore I would be hard-pressed to see a benefit? Thanks for any help!
Brett
@Brett: Yes, that's basically it. You need to keep everything in SSE to really get the benefit. There's a bit of a clue in the name SSE - Streaming SIMD Extensions. Just out of curiosity, where do those values come from, i.e. what's the bigger picture?
Skizz
@Skizz: So, the bigger picture is that it is actually part of a rotation matrix, but I'm doing one rotation matrix per iteration through a big loop where I compare feature vectors. Because of the structure, I see no foreseeable way to connect a bunch of SSE calculations at once, but given even the slightest benefit will definitely result in a great improvement to the runtime of my program. Alternatively, the b, d, f, h are sin and cos values that are calculated in the initialization phase and I'm currently moving to storing those in aligned blocks for faster multiplication. Thx for your help!
Brett