ansaurus

Question

Answer 1

+26 A:

No, the Strassen or Coppersmith-Winograd algorithm wouldn't make much difference here. They start to pay off for larger matrices only.

If your matrix-multiplication is really a bottleneck you could rewrite the algorithm using NEON SIMD instructions. That would only help for ARMv7 as ARMv6 does not has this extension though.

I'd expect a factor 3 speedup over the compiled C-code for your case.

EDIT: You can find a nice implementation in ARM-NEON here: http://code.google.com/p/math-neon/

For your C-code there are two things you could do to speed up the code:

Don't inline the function. Your matrix multiplication generates quite a bit of code as it's unrolled, and the ARM only has a very tiny instruction cache. Excessive inlining can make your code slower because the CPU will be busy loading code into the cache instead of executing it.
Use the restrict keyword to tell the compiler that the source- and destination pointers don't overlap in memory. Currently the compiler is forced to reload every source value from memory whenever a result is written because it has to assume that source and destination may overlap or even point to the same memory.

Nils Pipenbrinck 2009-11-04 14:21:20

Excellent advice for him to use NEON on ARMv7. One other note: when compiling this for ARMv6, be sure to turn thumb codegen off.

Stephen Canon 2009-11-04 16:13:31

that point #2 seems very important. +1

rmeador 2009-11-04 16:21:09

@rmeador, for the performance it makes about 25% difference (just checked).

Nils Pipenbrinck 2009-11-04 16:57:39

Nils, thanks so much for the excellent evaluation - you sure are a legend! (for those that dont know, google Farbrausch)

Till 2009-11-05 10:39:47

Hey Till.. Psst.. don't tell anyone.. Besides that.. You're the guy who wrote a midi-player back in the 90'th, aren't you? If I remember right we sat side by side for a weekend at a computer party in 97 or so..

Nils Pipenbrinck 2009-11-06 00:34:03

hehe, yes you're remembering correctly - those were the days - We were discussing octree quantization and stuff. It was either The Party or MekkaSymposium, dont remember exactly.

Till 2009-11-12 18:30:22

@Stephen very good advice to use ARM instead of THUMB code - even though I was aware of it, it still is important and should be noted. Thanks!

Till 2009-11-15 01:25:10

Answer 2

+2 A:

Your completely unrolled traditional product is likely pretty fast.

Your matrix is too small to overcome the overheard of managing a Strassen multiplication in its traditional form with explicit indexes and partitioning code; you'd likely lose any effect on optimization to that overhead.

But if you want fast, I'd use SIMD instructions if they are available. I'd be surprised if the ARM chips these days don't have them. If they do, you can manage all the products in row/colum in a single instruction; if the SIMD is 8 wide, you might manage 2 row/column multiplies in a single instruction. Setting the operands up to do that instruction might require some dancing around; SIMD instructions will easily pick up your rows (adjacent values), but will not pick up the columns (non-contiguous). And it may take some effort to compute the sum of the products in a row/column.

Ira Baxter 2009-11-04 14:25:05

Answer 3

+2 A:

Are you sure that your unrolled code is faster than the explicit loop based approach? Mind that the compilers are usually better than humans performing optimizations!

In fact, I'd bet there are more chances for a compiler to emit automatically SIMD instructions from a well written loop than from a series of "unrelated" statements...

You could also specify the matrices sizes in the argument declaration. Then you could use the normal bracket syntax to access the elements, and it could also be a good hint for the compiler to make its optimisations too.

fortran 2009-11-04 15:08:59

Answer 4

+8 A:

Just nitpicking. I wonder why people still obfuscate their code voluntarly? C is already difficult to read, no need to add to it.

static inline void Matrix4x4MultiplyBy4x4 (float src1[4][4], float src2[4][4], float dest[4][4])
{
dest[0][0] = src1[0][0] * src2[0][0] + src1[0][1] * src2[1][0] + src1[0][2] * src2[2][0] + src1[0][3] * src2[3][0]; 
dest[0][1] = src1[0][0] * src2[0][1] + src1[0][1] * src2[1][1] + src1[0][2] * src2[2][1] + src1[0][3] * src2[3][1]; 
dest[0][2] = src1[0][0] * src2[0][2] + src1[0][1] * src2[1][2] + src1[0][2] * src2[2][2] + src1[0][3] * src2[3][2]; 
dest[0][3] = src1[0][0] * src2[0][3] + src1[0][1] * src2[1][3] + src1[0][2] * src2[2][3] + src1[0][3] * src2[3][3]; 
dest[1][0] = src1[1][0] * src2[0][0] + src1[1][1] * src2[1][0] + src1[1][2] * src2[2][0] + src1[1][3] * src2[3][0]; 
dest[1][1] = src1[1][0] * src2[0][1] + src1[1][1] * src2[1][1] + src1[1][2] * src2[2][1] + src1[1][3] * src2[3][1]; 
dest[1][2] = src1[1][0] * src2[0][2] + src1[1][1] * src2[1][2] + src1[1][2] * src2[2][2] + src1[1][3] * src2[3][2]; 
dest[1][3] = src1[1][0] * src2[0][3] + src1[1][1] * src2[1][3] + src1[1][2] * src2[2][3] + src1[1][3] * src2[3][3]; 
dest[2][0] = src1[2][0] * src2[0][0] + src1[2][1] * src2[1][0] + src1[2][2] * src2[2][0] + src1[2][3] * src2[3][0]; 
dest[2][1] = src1[2][0] * src2[0][1] + src1[2][1] * src2[1][1] + src1[2][2] * src2[2][1] + src1[2][3] * src2[3][1]; 
dest[2][2] = src1[2][0] * src2[0][2] + src1[2][1] * src2[1][2] + src1[2][2] * src2[2][2] + src1[2][3] * src2[3][2]; 
dest[2][3] = src1[2][0] * src2[0][3] + src1[2][1] * src2[1][3] + src1[2][2] * src2[2][3] + src1[2][3] * src2[3][3]; 
dest[3][0] = src1[3][0] * src2[0][0] + src1[3][1] * src2[1][0] + src1[3][2] * src2[2][0] + src1[3][3] * src2[3][0]; 
dest[3][1] = src1[3][0] * src2[0][1] + src1[3][1] * src2[1][1] + src1[3][2] * src2[2][1] + src1[3][3] * src2[3][1]; 
dest[3][2] = src1[3][0] * src2[0][2] + src1[3][1] * src2[1][2] + src1[3][2] * src2[2][2] + src1[3][3] * src2[3][2]; 
dest[3][3] = src1[3][0] * src2[0][3] + src1[3][1] * src2[1][3] + src1[3][2] * src2[2][3] + src1[3][3] * src2[3][3]; 
};

tristopia 2009-11-04 15:35:34

You are right - there is no point in doing it as obfuscated as I did. Shame on me ;)

Till 2009-11-15 01:25:53

Answer 5

+1 A:

Are these arbitrary matrices or do they have any symmetries? If so, those symmetries can often to exploited for improved performance (for example in rotation matrices).

Also, I agree with fortran above, and would run some timing tests to verify that your hand unrolled code is faster than an optimizing compiler can create. At the least, you may be able to simplify your code.

Paul

Paul 2009-11-04 16:09:07

They are used as transformation matrices including all possible usages (rotation, scaling, translation, skewing,...).

Till 2009-11-05 10:33:55

Well rotation matrices are symmetric so there goes half of your multiplies. Translations matrices are nearly all zeros. This approach would require different routines for different transforms. But as you know, code complexity is often the cost of performance tuning.

Paul 2009-11-06 05:42:26

ansaurus

tags:

views:

answers:

Fast 4x4 Matrix Multiplication in C

related questions