Hello,
I'm trying to come up with a way to make the computer do some work for me. I'm using SIMD (SSE2 & SSE3) to calculate the cross product, and I was wondering if it could go any faster. Currently I have the following:
const int maskShuffleCross1 = _MM_SHUFFLE(3,0,2,1); // y z x
const int maskShuffleCross2 = _MM_SHUFFLE(3,1,0,2); // z x y
__m128 QuadCrossProduct(__m128* quadA, __m128* quadB)
{
// (y * other.z) - (z * other.y)
// (z * other.x) - (x * other.z)
// (x * other.y) - (y * other.x)
return
(
_mm_sub_ps
(
_mm_mul_ps
(
_mm_shuffle_ps(*quadA, *quadA, maskShuffleCross1),
_mm_shuffle_ps(*quadB, *quadB, maskShuffleCross2)
),
_mm_mul_ps
(
_mm_shuffle_ps(*quadA, *quadA, maskShuffleCross2),
_mm_shuffle_ps(*quadB, *quadB, maskShuffleCross1)
)
)
);
}
As you can see, there are four _mm_shuffle_ps
's in there, and I wondered if I could replace them with a combination of _mm_unpackhi_ps
and _mm_unpacklo_ps
which return a2 a3 b2 b3
and a0 a1 b0 b1
respectively and are slightly faster.
I couldn't figure it out on paper, but I thought of a solution. What if let the computer bruteforce the steps required? Just recursively step through the different options and see what gives the correct answer.
I got it work with multiply, it returns this when I want it to return (3, 12, 27, 0):
startA = _mm_set_ps(1.00, 2.00, 3.00, 0.00);
startB = _mm_set_ps(3.00, 3.00, 3.00, 0.00);
result0 = _mm_mul_ps(startA, startB);
// (3.00, 6.00, 9.00, 0.00)
result1 = _mm_mul_ps(startA, result0);
// (3.00, 12.00, 27.00, 0.00)
Very nice, if I say so myself.
However, when I wanted to implement divide I stumbled on a problem. Multiply doesn't just have to call multiply, it also has to call divide. Okay, so we put divide above multiply. But divide doesn't just have to call divide, it also has to call multiply, which is lower in the script, so it doesn't exist yet.
I started with an empty console application in Visual C++ and put everything in QuadTests.cpp.
How do I make sure these two functions can call each other?
Thanks in advance.