views:

574

answers:

7

Hello. I'm trying to find out some matrix multiplication/inversion benchmarks online. My C++ implementation can currently invert a 100 x 100 matrix in 38 seconds, but compared to this benchmark I found, my implementation's performances really suck. I don't know if it's a super-optimized something or if really you can easily invert a 200 x 200 matrix in about 0.11 seconds, so I'm looking for more benchmarks to compare the results. Have you god some good link?

UPDATE I spotted a bug in my multiplication code, that didn't affect the result but was causing useless cycle waste. Now my inversion executes in 20 seconds. It's still a lot of time, and any idea is welcome.

Thank you folks

A: 

The LINPACK benchmarks are based on solving linear algebra problems. They're available for different machines and languages. Maybe they can help you, too.

LINPACK C++ libraries available here, too.

duffymo
LINPACK C++ libraries available here: http://people.sc.fsu.edu/~burkardt/cpp_src/linpack/linpack.html
duffymo
A: 

I actually gained about 7 seconds using doubles instead of long doubles, but that's not a great deal since I lost half of my precision.

tunnuz
Are you sure about that? In my experience you don't get nearly twice the precision with long double. See http://stackoverflow.com/questions/476212/what-is-the-precision-of-long-double-in-c for details.
Bill the Lizard
Which says nothing about how much precision you end up with at the end of a lot of computations. In a given computation-intensive application, he might be losing half his precision.
David Thornley
No, actually was an estimate based on the number of bits per number.
tunnuz
Thats odd, the FPU should do all its computation in 80 bits which is the size of a long double. (unless you are compiling for x64)
Ismael
+1  A: 

I don't know if it's a super-optimized something or if really you can easily invert a 200 x 200 matrix in about 0.11 seconds

MATLAB does that without breaking a sweat either. Are you implementing the LAPACK routines for matrix inversion (e.g. LU decomposition)?

Zach Scrivena
I'm doing it with LUP factorization.
tunnuz
@tunnuz: Another reason for the speed discrepancy is that many of the optimized implementations are written with particular instruction sets in mind (e.g. Intel's Math Kernel Library). Many are also tweaked at the assembly level for improved performance.
Zach Scrivena
+3  A: 

Check if you are passing huge matrix objects by value (As this could be costly if copying the whole matrix).
If possable pass by reference.

The thing about matricies and C++ is that you want to avoid copying as much as possable.
So your main object should probably not conatain the "matrix data" but rather contain meta data about the matrix and a pointer (wrapped in by somthing smart) to the data portion. Thus when copying an object you only copy a small chunk of data not the whole thing (see string implementation for an example).

Martin York
I'm doing exactly as you explained, my matrices contain only meta data (rows, columns, determinant) and shared pointers to the content.
tunnuz
+2  A: 

Why do you need to implement your own matrix library in the first place? As you've already discovered, there are already extremely efficient libraries available doing the same thing. And as much as people like to think of C++ as a performance language, that's only true if you're really good at the language. It is extremely easy to write terribly slow code in C++.

jalf
You seem to limit your comments to C++. It seems to me that it is equally easy to write slow code in any language.
Jon Trauntvein
Yes and no. It's certainly possible to write slow code in any language, but in C++, it's the default, if you don't know exactly what you're doing. As an example, C# does not invoke dozens of unneeded copy operations, and it does not punish you for calling new(). Those can be very expensive in C++.
jalf
There are many seemingly harmless things you can do in C++ which nevertheless ruins performance. Taking something like Python or Java, performance is much more straightforward and predictable even to non-experts.
jalf
+2  A: 

This sort of operation is extremely cache sensitive. You want to be doing most of your work on variables that are in your L1 & L2 cache. Check out section 6 of this doc:

http://people.redhat.com/drepper/cpumemory.pdf

He walks you through optimizing a matrix multiply in a cache-optimized way and gets some big perf improvements.

twk
+1  A: 

Have you tried profiling it?

Following this paper (pdf), the calculation for a 100x100 matrix with LU decomposition will need 1348250 (floating point operations). A core 2 can do around 20 Gflops (processor metrics). So theoretically speaking you can do an inversion in 1 ms.

Without the code is pretty difficult to assert what is the cause of the large gap. From my experience trying micro-optimization like loop unrolling, caching values, SEE, threading, etc, you only will get a speed up, which at best is only a constant factor of you current (which maybe enough for you).

But if you want an order of magnitude speed increase you should take a look at your algorithm, perhaps your implementation of LU decomposition have a bug. Another place to take a look is the organization of your data, try different organization, put row/columns elements together.

Ismael