ansaurus

Question

Answer 1

+2 A:

If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.

If the matrix is, say, 1,000x1,000 yoy will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.

Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.

Finally, if you change the computer you are using for a faster one the differences will be even narrower!

Pablo Rodriguez 2009-12-15 13:53:40

Yes I'm trying with 800X800 matrix, but the result between them is in the best case 1.4 times faster, not very significant. I think it depends on the cache memory.

Peter 2009-12-15 13:55:52

800 x 800 is small, try something in the range 8000 x 8000 as well

High Performance Mark 2009-12-15 13:58:49

@Peter I think you are correct

Andreas Brinck 2009-12-15 14:03:13

Answer 2

A:

If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect. Coming to the program, can you paste the program you have written.

Roopesh Majeti 2009-12-15 13:56:41

What I'm doing is starting from a 1x1 matrix until 1000X1000, the bigest one is multiplied 1 time, and the smaller one a number proportional to the maximum (MxMxM / NxNxN, being M the bigest one). So i.e a 1x1 matrix will be multiplied 1250000 times, 2x2 578703 and so on. Then I divided the total time for the total iterations for that dimension

Peter 2009-12-15 14:13:29

Answer 3

A:

Hi

Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.

Don't forget, when measuring execution time, to include the time to transpose matrixB.

Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.

Regards

Mark

High Performance Mark 2009-12-15 13:57:30

Answer 4

A:

How big improvements you get will depend on:

The size of the cache
The size of a cache line
The degree of associativity of the cache

For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.

Andreas Brinck 2009-12-15 14:01:13

Answer 5

A:

ATTENTION: You have a BUG in your second implementation

for (f = 0; f < i; f++) {
    for (co = 0; co < i; co++) {
        MatrixB[f][co] = MatrixB[co][f];
    }
}

When you do f=0, c=1

        MatrixB[0][1] = MatrixB[1][0];

you overwrite MatrixB[0][1] and lose that value! When the loop gets to f=1, c=0

        MatrixB[1][0] = MatrixB[0][1];

the value copied is the same that was already there.

pmg 2009-12-15 14:28:41

Oh! I see, should I use a tmp matrix? That will increase the time even more than the first method

Peter 2009-12-15 14:41:24

You can transpose the matrix with one temporary variable: `for (f=0; f<i-1; f++) for (co=f+1; co<i; co++) swap(MatrixB[f]+co, MatrixB[co]+f);`

pmg 2009-12-16 08:58:35

Answer 6

+1 A:

Here look at this article which shows you how to tune matrix multiplication but also points you to 2 different libraries which are even better tuned.

wheaties 2009-12-15 14:31:42

Answer 7

A:

Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:

for (k = 0; k < i; k++)
{
    int sums[i];//I know this size declaration is illegal in C. consider 
            //this pseudo-code.
    for (l = 0; l < i; l++)
        sums[l] = MatrixA[j][l]*MatrixB[k][l];

    int suma = 0;
    for(int s = 0; s < i; s++)
       suma += sums[s];
}

This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.

This is just my logic. Others with more knowledge in the area may disagree.

San Jacinto 2009-12-15 14:46:19

Comment 1: If your concerned about the instruction pipeline, consider unrolling the loops. Reducing the number of branches will speed up execution far more then separating multiplication from addition (some processors have single instructions that multiply and add).

Thomas Matthews 2009-12-15 19:33:48

Comment 2: Move the `sums` array and the `suma` variables outside the scope of the `for` loop. Also consider moving the index variables `l` and `s` too. When they are inside the outer `for` loop they will created and destroyed for every iteration of the outer `for` loop. Removing unnecessary instructions will also speed up execution.

Thomas Matthews 2009-12-15 19:36:49

i was actually assuming the compiler would unroll the loops because that's pretty common, but i think it's rather unlikely that it will cache the multiplications separately for you, as it's difficult to determine the semantics of a reduction. Maybe i'm worng and every compiler does it. for your second comment, you are incorrect. the semantics in the code show that the memory is coming off of the stack, which means it is set aside at the beginning of the function, one time.

San Jacinto 2009-12-15 19:51:01

Also, in regard to your first comment, I think you are missing the point. It's not atomicity of the mult we are concerned about, it's the time it takes to execute a mult. Modern processors can still take tens of times longer to do a mult than an add. So, if you stall the pipeline until you have a value to put into suma, you very well may be waiting 40 cycles+ longer each iteration of the loop than you need to, as opposed to only stalling once or twice (if you unroll the loop, or the compiler does).

San Jacinto 2009-12-16 00:31:44

Answer 8

A:

For large matrices you'll want to tile the multiplication to keep stuff in the cache. I recall trying this as a test and measured a 12x performance difference, but I specifically picked a matrix size that consumed multiples of my cache.

There's a lot of literature on the subject. A starting point is:

http://en.wikipedia.org/wiki/Loop%5Ftiling

(there was a book by an author at intel that I recall being particularily helpful, but it's been ten years since reading it, and I can't recall that author's name).

Peeter Joot 2009-12-15 16:01:46

Answer 9

+2 A:

What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.

Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.

Alok 2009-12-15 22:06:11

ansaurus

tags:

views:

answers:

Optimized matrix multiplication in C

related questions