ansaurus

Question

How does BLAS get such extreme performance?

Answer 1

+13 A:

For many reasons.

First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).

Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.

Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.

For example, you could rework your code this way

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");

memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
    a1 = cc2*ADim2;
    a3 = cc2*BDim1
    for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
          a2=cc1*ADim1;
          ValT b = B[a3+cc1];
          for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
                    C[a1+cr1] += A[a2+cr1]*b;
           }
     }
  }
}

Try it, I am sure you will save something.

On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.

Stefano Borini 2009-08-19 23:36:42

In terms of the multiplications to find indices, I tried actually taking them out and only doing them the minimum, ie. C, B the mult can be moved below the first for loop. This showed no improvement, I believe my compiler optimized it somehow... I think register optimization would definitely improve my code, however any response to #1?

DeusAduro 2009-08-19 23:42:02

did you try to crank the optimization flag to -O3 as well ?

Stefano Borini 2009-08-19 23:49:37

Ya its got all optimizations on. Thanks for the link, although that one is 'never implemented' but the Strassen algorithm is interesting.

DeusAduro 2009-08-20 00:09:15

about row vs columnwise, matrix multiplication is going to thrash the cache no matter what, since the left hand side gets traversed row-wise, and the right hand column-wise.

jalf 2009-08-20 12:03:44

@Jalf: yes. in the case of matrix multiplication this happens. I wonder, however, if pretransposing the matrix could be a good strategy.

Stefano Borini 2009-08-20 21:36:47

Answer 2

+1 A:

This is a realistic speed up. For an example of what can be done with SIMD assembler over C++ code, see some example iPhone matrix functions - these were over 8x faster than the C version, and aren't even "optimized" assembly - there's no pipe-lining yet and there is unnecessary stack operations.

Also your code is not "restrict correct" - how does the compiler know that when it modifies C, it isn't modifying A and B?

Justicle 2009-08-20 00:10:54

Sure if you called the function like mmult(A..., A..., A); you would certainly not get the expected result. Again though I wasn't trying to beat/re-implement BLAS, just seeing how fast it really is, so error checking wasn't in mind, just the basic functionality.

DeusAduro 2009-08-20 00:15:25

Sorry, to be clear, what I'm saying is that if you put "restrict" on your pointers, you'd get much faster code. This is because every time you modifiy C, the compiler doesn't have to reload A and B - dramatically speeding up the inner loop. If you don't believe me, check the disassembly.

Justicle 2009-08-20 00:34:34

@DeusAduro: This isn't error checking - it's possible that the compiler is unable to optimize the accesses to the B[] array in the inner loop because it might not be able to figure out that the A and C pointers never alias the B array. If there were aliasing it would be possible for the value in the B array to change while the inner loop is executing. Hoisting the access to the B[] value out of the inner loop and putting it in a local variable might enable the compiler to avoid continual accesses to B[].

Michael Burr 2009-08-20 00:35:06

Thanks for the clarification I guess I really didn't understand what you are saying, I'll see if I can add that and update the timings.

DeusAduro 2009-08-20 00:58:37

Hmmm, so I tried first using the '__restrict' keyword in VS 2008, applied to A, B, and C. This showed no change in the result. However moving the access to B, from the innermost loop to the loop outside improved the time by ~10%.

DeusAduro 2009-08-20 01:13:06

Sorry, I'm not sure about VC, but with GCC you need to enable `-fstrict-aliasing`. There's also better explanation of "restrict" here: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html

Justicle 2009-08-20 01:46:37

Answer 3

+6 A:

I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm

Pratik 2009-08-20 00:15:17

Answer 4

+2 A:

First, there are more efficient algorithms for matrix multiplication than the one you're using.

Second, your CPU can do much more than one instruction at a time.

Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)

Third, your code is far from optimal:

You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't just happen.

jalf 2009-08-20 12:12:12

Thanks, I've added restrict correct code as per Justicle's suggestion, didn't see much improvement, I like the blockwise idea. Out of curiosity, without knowing the CPU's cache size how would one right optimal code?

DeusAduro 2009-08-21 06:25:07

You don't. To get optimal code, you need to know the CPU's cache size.Of course the downside to this is that you're effectively hardcoding your code for best performance on *one* family of CPU's.

jalf 2009-08-21 10:00:56

Answer 5

A:

Not an answer, but two observations ...

You can't beat BLAS, so learn how to link it, and spend your time writing code that you need.
Program matrix multiplication in a language which provides it as a library routine or intrinsic function. That way you may find that your compiler calls BLAS for you.

Regards

Mark

High Performance Mark 2009-08-20 23:32:14

Quite obviously I can link BLAS, seeing as I benchmarked it in the question. And yes I have no plans on beating it.

DeusAduro 2009-08-21 06:24:01

ansaurus

tags:

views:

answers:

How does BLAS get such extreme performance?

related questions