The usual way to do this sort of vectorization is to turn the problem "on its side". Instead of computing a single value of ox
and oy
, you compute four ox
values and four oy
values simultaneously. This minimizes wasted computation and shuffles.
In order to do this, you bundle up several x
, y
, p2x
and p2y
values into contiguous arrays (i.e. you might have an array of four values of x
, an array of four values of y
, etc). Then you can just do:
movups %xmm0, [x]
movups %xmm1, [y]
movaps %xmm2, %xmm0
mulps %xmm0, [c] // cx
movaps %xmm3, %xmm1
mulps %xmm1, [s] // sy
mulps %xmm2, [s] // sx
mulps %xmm3, [c] // cy
subps %xmm0, %xmm1 // cx - sy
subps %xmm2, %xmm3 // sx - cy
mulps %xmm0, scale // (cx - sy)*m
mulps %xmm2, scale // (sx - cy)*m
movaps %xmm1, [p2x]
movaps %xmm3, [p2y]
subps %xmm1, %xmm0 // p2x - (cx - sy)*m
subps %xmm3, %xmm2 // p2y - (sx - cy)*m
movups [ox], %xmm1
movups [oy], %xmm3
Using this approach, we compute 4 results simultaneously in 18 instructions, vs. a single result in 13 instructions with your approach. We're also not wasting any results.
It could still be improved on; since you would have to rearrange data structures anyway to use this approach, you should align the arrays and use aligned loads and stores instead of unaligned. You should load c and s into registers and use them to process many vectors of x and y, instead of reloading them for each vector. For the best performance, two or more vectors worth of computation should be interleaved to make sure the processor has enough work to do an prevent pipeline stalls.
(On a side note: should it be cx + sy
instead of cx - sy
? That would give you a standard rotation matrix)
Edit
Your comment on what hardware you're doing your timings on pretty much clears everything up: "Pentium 4 HT, 2.79GHz". That's a very old microarchitecture, on which unaligned moves and shuffles are quite slow; you don't have enough work in the pipeline to hide the latency of the arithmetic operations, and the reorder engine isn't nearly as clever as it is on newer microarchitectures.
I expect that your vector code would prove to be faster than the scalar code on i7, and probably on Core2 as well. On the other hand, doing four at a time, if you could, would be much faster still.