Hi. I'm trying to write a Matrix3x3 multiply using the Vector Floating Point on the iPhone, however i'm encountering some problems. This is my first attempt at writing any ARM assembly, so it could be a faily simple solution that i'm not seeing.
I've currently got a small application running using a maths library that i've written. I'm investigating into the benifits using the Vector Floating Point Unit would provide so i've taken my matrix multiply and converted it to asm. Previously the application would run without a problem, however now my objects will all randomly disappear. This seems to be caused by the results from my matrix multiply becoming NAN at some point.
Heres the code
IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
IMatrix3x3 C;
//C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;
C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;
C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;
//VPU ARM asm for the device
#else
//create a pointer to the Matrices
IMatrix3x3 * pA = &_A;
IMatrix3x3 * pB = &_B;
IMatrix3x3 * pC = &C;
//asm code
asm volatile(
//turn on a vector depth of 3
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00020000 \n\t"
"fmxr fpscr, r0 \n\t"
//load matrix B into the vector bank
"fldmias %1, {s8-s16} \n\t"
//load the first row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.A0, C.A1 and C.A2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the second row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.B0, C.B1 and C.B2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the third row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.C0, C.C1 and C.C2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//set the vector depth back to 1
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00000000 \n\t"
"fmxr fpscr, r0 \n\t"
//pass the inputs and set the clobber list
: "+r"(pA), "+r"(pB), "+r" (pC) :
:"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
);
#endif
return C;
}
As far as i can see that makes sence. While debugging i've managed to notice that if i were to say _A = C
prior to the return and after the ASM, _A
will not necessarily be equal to C
which has only increased my confusion. I had thought it was possibly due to the pointers I'm giving to the VFPU being incrimented by lines such as "fldmias %0!, {s0-s2} \n\t"
however my understanding of asm is not good enough to properly understand the problem, nor to see an alternative approach to that line of code.
Anyway, I was hoping someone with a greater understanding than me would be able to see a solution, and any help would be greatly appreciated, thank you :-)
Edit: I've found that pC
seems to be NULL when the asm code is hit despite being set pC = &C
. I'm assuming this is due to the compiler rearranging the code in a manor thats breaking it? I've tried the various methods I've seen for stopping this happening (like adding everything relevent in the input list - thought this shouldnt even be nessisary since i'm listing "memory" in the clobber list) and I'm still getting the same problems.
Edit #2: Right, the memory issue seems to have been caused by me not including "r0"
in the clobber list, however fixing that (if it is indeed fixed) doesnt seem to have fixed the problem. I noticed that multiplying a rotation matrix by the identity matrix doesn't work correctly and instead gives 0.88 as the last entry in the matrix instead of 1:
| 0.88 0.48 0 | | 1 0 0 | | 0.88 0.48 0 |
|-0.48 0.88 0 | * | 0 1 0 | = |-0.48 0.88 0 |
| 0 0 1 | | 0 0 1 | | 0 0 0.88|
I figured then that my logic must be wrong somewhere so i stepped through the assembly. everything seems fine up until the last "fmacs s17, s14, s2 \n\t" where:
s0 = 0 s14 = 0 s17 = 0
s1 = 0 s15 = 0 s18 = 0
s2 = 1 s16 = 1 s19 = 0
so surely the fmacs
is performing the operation:
s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1
However the result gives s19 = 0.88
which has left me even more confused :S am i misunderstanding how fmacs
works? (P.S sorry for what has now become a really long question :-P)