views:

1408

answers:

10

I need a way to represent a 2-D array (a dense matrix) of doubles in C++, with absolute minimum accessing overhead.

I've done some timing on various linux/unix machines and gcc versions. An STL vector of vectors, declared as:

vector<vector<double> > matrix(n,vector<double>(n));

and accessed through matrix[i][j] is between 5% and 100% slower to access than an array declared as:

double *matrix = new double[n*n];

accessed through an inlined index function matrix[index(i,j)], where index(i,j) evaluates to i+n*j. Other ways of arranging a 2-D array without STL - an array of n pointers to the start of each row, or defining the whole thing on the stack as a constant size matrix[n][n] - run at almost exactly the same speed as the index function method.

Recent GCC versions (> 4.0) seem to be able to compile the STL vector-of-vectors to nearly the same efficiency as the non-STL code when optimisations are turned on, but this is somewhat machine-dependent.

I'd like to use STL if possible, but will have to choose the fastest solution. Does anyone have any experience in optimising STL with GCC?

+8  A: 

My guess would be the fastest is, for a matrix, to use 1D STL array and override the () operator to use it as 2D matrix.

However, the STL also defines a type specifically for non-resizeable numerical arrays: valarray. You also have various optimisations for in-place operations.

valarray accept as argument a numerical type:

valarray<double> a;

Then, you can use slices, indirect arrays, ... and of course, you can inherit the valarray and define your own operator()(int i, int j) for 2D arrays ...

PierreBdR
My upvote is for valarray, not necessarily to make a custom matrix type. Well, custom matrix type could work, but should still be based off valarray instead of vector (valarray supports slicing, which makes getting a column just as easy as getting a row).
Chris Jester-Young
Careful inheriting from std::valarray; it is not designed for inheritance, as most of the "STL".
Patrick Johnmeyer
You can inherit any class of the STL as long as you don't add data to them as the constructor won't be called. There is no pb adding methods though.
PierreBdR
You can inherit "not-designed-for-inheritance" classes if you're using _private_ inheritance, if I understand correctly. e.g., class matrix : private std::valarray<double> or the like. Essentially you're not relying on any virtual behaviour.
Chris Jester-Young
Public or protected inheritance of "not-designed-for-inheritance" classes is a no-no, however.
Chris Jester-Young
I disagree with the last statement. The problem with "not-designed-for-inheritance" classes is the destructor is not virtual => the derived destructor might not be called. So, any variable allocated in the derived class won't be freed => you should not add any variable.
PierreBdR
Note that even if you don't add member data, deleting a derived object via pointer to base type is undefined behavior if the destructor is nonvirtual.
+5  A: 

My recommendation would be to use Boost.UBLAS, which provides fast matrix/vector classes.

Martin Cote
I should have clarified that while I'm dealing with matrices, the operations I'm performing aren't typical linear algebra. UBLAS looks very good for linear algebra, but perhaps overkill if I'm only using it as 2D array storage.
Chris Johnson
I have tried various linear algebra libraries for use as 2d data (maps) but they are not convenient to use for non-linear algebra purposes nor faster than a vector of vectors. UBLAS (and others) is only fast for multiplication and other 'typical' matrix usages, not so much for accessing.
Roel
+5  A: 

Very likely this is a locality-of-reference issue. vector uses new to allocate its internal array, so each row will be at least a little apart in memory due to each block's header; it could be a long distance apart if memory is already fragmented when you allocate them. Different rows of the array are likely to at least incur a cache-line fault and could incur a page fault; if you're really unlucky two adjacent rows could be on memory lines that share a TLB slot and accessing one will evict the other.

In contrast your other solutions guarantee that all the data is adjacent. It could help your performance if you align the structure so it crosses as few cache lines as possible.

vector is designed for resizable arrays. If you don't need to resize the arrays, use a regular C++ array. STL operations can generally operate on C++ arrays.

Do be sure to walk the array in the correct direction, i.e. across (consecutive memory addresses) rather than down. This will reduce cache faults.

Mike Dimmick
I hadn't thought about the block headers in the vector solution. I knew about the potential slowdown from walking the wrong way though: my speed tests show that walking the wrong way can be four times slower than doing it the right way!
Chris Johnson
+7  A: 

If you're using GCC the compiler can analyze your matrix accesses and change the order in memory in certain cases. The magic compiler flag is defined as:

-fipa-matrix-reorg

Perform matrix flattening and transposing. Matrix flattening tries to replace a m-dimensional matrix with its equivalent n-dimensional matrix, where n < m. This reduces the level of indirection needed for accessing the elements of the matrix. The second optimization is matrix transposing that attemps to change the order of the matrix's dimensions in order to improve cache locality. Both optimizations need fwhole-program flag. Transposing is enabled only if profiling information is avaliable.

Note that this option is not enabled by -O2 or -O3. You have to pass it yourself.

Nils Pipenbrinck
Does this really work with std::vector? I doubt it.
lothar
Would be both amazing and scary indeed.
peterchen
+1  A: 

To be fair depends on the algorithms you are using upon the matrix.

The double name[n*m] format is very fast when you are accessing data by rows both because has almost no overhead besides a multiplication and addition and because your rows are packed data that will be coherent in cache.

If your algorithms access column ordered data then other layouts might have much better cache coherence. If your algorithm access data in quadrants of the matrix even other layouts might be better.

Try to make some research directed at the type of usage and algorithms you are using. That is specially important if the matrix are very large, since cache misses may hurt your performance way more than needing 1 or 2 extra math operations to access each address.

OldMan
+1  A: 

You could just as easily do vector< double >( n*m );

tfinniga
A: 

There is the uBLAS implementation in Boost. It is worth a look.

http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/matrix.htm

ceretullis
+1  A: 

You may want to look at the Eigen C++ template library at http://eigen.tuxfamily.org/ . It generates AltiVec or sse2 code to optimize the vector/matrix calculations.

lothar
A: 

Another related library is Blitz++: http://www.oonumerics.org/blitz/docs/blitz.html

Blitz++ is designed to optimize array manipulation.

A: 

I have done this some time back for raw images by declaring my own 2 dimensional array classes.

In a normal 2D array, you access the elements like:

array[2][3]. Now to get that effect, you'd have a class array with an overloaded [] array accessor. But, this would essentially return another array, thereby giving you the second dimension.

The problem with this approach is that it has a double function call overhead.

The way I did it was to use the () style overload.

So instead of array[2][3], change I had it do this style array(2,3).

That () function was very tiny and I made sure it was inlined.

See this link for the general concept of that: http://www.learncpp.com/cpp-tutorial/99-overloading-the-parenthesis-operator/

You can template the type if you need to.
The difference I had was that my array was dynamic. I had a block of char memory I'd declare. And I employed a column cache, so I knew where in my sequence of bytes the next row began. Access was optimized for accessing neighbouring values, because I was using it for image processing.

It's hard to explain without the code but essentially the result was as fast as C, and much easier to understand and use.

Matt H