tags:

views:

45

answers:

1

Hello,

I’ve been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can’t get the use of these functions to give me any performance gain over very basic routines. Specifically, the case of 4x4 matrix multiplication. Wherever I couldn’t make use of affine or homogeneous properties of the matrices, I’ve been using this routine (abridged):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

The equivalent function call for Cblas is:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function.

This doesn’t seem right to me, and I’m left with the feeling that I’m doing something wrong somewhere. Do I have to to enable the device’s NEON unit and SIMD functionality somehow? Or shouldn’t I hope for better performance with such small matrices?

Very much appreciated,

Bastiaan