+5  A: 

The usual way to do this sort of vectorization is to turn the problem "on its side". Instead of computing a single value of ox and oy, you compute four ox values and four oy values simultaneously. This minimizes wasted computation and shuffles.

In order to do this, you bundle up several x, y, p2x and p2y values into contiguous arrays (i.e. you might have an array of four values of x, an array of four values of y, etc). Then you can just do:

movups  %xmm0,  [x]
movups  %xmm1,  [y]
movaps  %xmm2,  %xmm0
mulps   %xmm0,  [c]    // cx
movaps  %xmm3,  %xmm1
mulps   %xmm1,  [s]    // sy
mulps   %xmm2,  [s]    // sx
mulps   %xmm3,  [c]    // cy
subps   %xmm0,  %xmm1  // cx - sy
subps   %xmm2,  %xmm3  // sx - cy
mulps   %xmm0,  scale  // (cx - sy)*m
mulps   %xmm2,  scale  // (sx - cy)*m
movaps  %xmm1,  [p2x]
movaps  %xmm3,  [p2y]
subps   %xmm1,  %xmm0  // p2x - (cx - sy)*m
subps   %xmm3,  %xmm2  // p2y - (sx - cy)*m
movups  [ox],   %xmm1
movups  [oy],   %xmm3

Using this approach, we compute 4 results simultaneously in 18 instructions, vs. a single result in 13 instructions with your approach. We're also not wasting any results.

It could still be improved on; since you would have to rearrange data structures anyway to use this approach, you should align the arrays and use aligned loads and stores instead of unaligned. You should load c and s into registers and use them to process many vectors of x and y, instead of reloading them for each vector. For the best performance, two or more vectors worth of computation should be interleaved to make sure the processor has enough work to do an prevent pipeline stalls.

(On a side note: should it be cx + sy instead of cx - sy? That would give you a standard rotation matrix)

Edit

Your comment on what hardware you're doing your timings on pretty much clears everything up: "Pentium 4 HT, 2.79GHz". That's a very old microarchitecture, on which unaligned moves and shuffles are quite slow; you don't have enough work in the pipeline to hide the latency of the arithmetic operations, and the reorder engine isn't nearly as clever as it is on newer microarchitectures.

I expect that your vector code would prove to be faster than the scalar code on i7, and probably on Core2 as well. On the other hand, doing four at a time, if you could, would be much faster still.

Stephen Canon
I agree with what you're saying, but I can't calculate multiple ox's and oy's at a time, I have to do one calculation, then wait for my program to do some more work until I come back to this calculation.Therefore, I'm not sure there is any way that I can utilize all of the memory like we would both hope to see (calculating four floating points simultaneously).Can you think of anything that I can do to help optimize the way this is being calculated for one single calculation rather than a set?Thanks!!Brett
Brett
Also, yeah cx+sy is correct, thank you for noticing that
Brett