views:

210

answers:

1

Hi. I'm trying to write a Matrix3x3 multiply using the Vector Floating Point on the iPhone, however i'm encountering some problems. This is my first attempt at writing any ARM assembly, so it could be a faily simple solution that i'm not seeing.

I've currently got a small application running using a maths library that i've written. I'm investigating into the benifits using the Vector Floating Point Unit would provide so i've taken my matrix multiply and converted it to asm. Previously the application would run without a problem, however now my objects will all randomly disappear. This seems to be caused by the results from my matrix multiply becoming NAN at some point.

Heres the code

IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
    IMatrix3x3 C;

    //C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
    C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
    C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
    C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;

    C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
    C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
    C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;

    C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
    C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
    C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;

//VPU ARM asm for the device
#else   
    //create a pointer to the Matrices
    IMatrix3x3 * pA = &_A;
    IMatrix3x3 * pB = &_B;
    IMatrix3x3 * pC = &C;

//asm code
asm volatile(
             //turn on a vector depth of 3
             "fmrx r0, fpscr \n\t"
             "bic r0, r0, #0x00370000 \n\t"
             "orr r0, r0, #0x00020000 \n\t"
             "fmxr fpscr, r0 \n\t"

             //load matrix B into the vector bank
             "fldmias %1, {s8-s16} \n\t"

             //load the first row of A into the scalar bank
             "fldmias %0!, {s0-s2} \n\t"

             //calulate C.A0, C.A1 and C.A2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, {s17-s19} \n\t"

             //load the second row of A into the scalar bank
             "fldmias %0!, {s0-s2} \n\t"

             //calulate C.B0, C.B1 and C.B2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, {s17-s19} \n\t"

             //load the third row of A into the scalar bank
             "fldmias %0!, {s0-s2} \n\t"

             //calulate C.C0, C.C1 and C.C2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, {s17-s19} \n\t"

             //set the vector depth back to 1
             "fmrx r0, fpscr \n\t"
             "bic r0, r0, #0x00370000 \n\t"
             "orr r0, r0, #0x00000000 \n\t"
             "fmxr fpscr, r0 \n\t"

             //pass  the inputs and set the clobber list
             : "+r"(pA), "+r"(pB), "+r" (pC) :
             :"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
             );
#endif
    return C;
}

As far as i can see that makes sence. While debugging i've managed to notice that if i were to say _A = C prior to the return and after the ASM, _A will not necessarily be equal to C which has only increased my confusion. I had thought it was possibly due to the pointers I'm giving to the VFPU being incrimented by lines such as "fldmias %0!, {s0-s2} \n\t" however my understanding of asm is not good enough to properly understand the problem, nor to see an alternative approach to that line of code.

Anyway, I was hoping someone with a greater understanding than me would be able to see a solution, and any help would be greatly appreciated, thank you :-)

Edit: I've found that pC seems to be NULL when the asm code is hit despite being set pC = &C. I'm assuming this is due to the compiler rearranging the code in a manor thats breaking it? I've tried the various methods I've seen for stopping this happening (like adding everything relevent in the input list - thought this shouldnt even be nessisary since i'm listing "memory" in the clobber list) and I'm still getting the same problems.

Edit #2: Right, the memory issue seems to have been caused by me not including "r0" in the clobber list, however fixing that (if it is indeed fixed) doesnt seem to have fixed the problem. I noticed that multiplying a rotation matrix by the identity matrix doesn't work correctly and instead gives 0.88 as the last entry in the matrix instead of 1:

| 0.88 0.48 0 |     | 1 0 0 |     | 0.88 0.48 0   |
|-0.48 0.88 0 |  *  | 0 1 0 |  =  |-0.48 0.88 0   |
| 0    0    1 |     | 0 0 1 |     | 0    0    0.88|

I figured then that my logic must be wrong somewhere so i stepped through the assembly. everything seems fine up until the last "fmacs s17, s14, s2 \n\t" where:

s0 = 0    s14 = 0    s17 = 0
s1 = 0    s15 = 0    s18 = 0
s2 = 1    s16 = 1    s19 = 0

so surely the fmacs is performing the operation:

s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1

However the result gives s19 = 0.88 which has left me even more confused :S am i misunderstanding how fmacs works? (P.S sorry for what has now become a really long question :-P)

A: 

Solved the problem! i was unaware that the vector banks were "circular".

The banks 0-7, 8-15, 16-23 and 24-31 can contain vectors of up to a length of 8, and can be used as vectors by simply stating you are using s16 with a length of 4 for example. However, in my case i had been using s14 with a length of 3, assuming this would get me s14,s15 and s16, but instead because its circular it would roll back to s8 - in other words i was using s14, s15 and s8.

Took my a long time to see that, so hopefully if anyone else has a similar problem they will find this :-)

AzCopey