views:

122

answers:

2

I am currently working on a C++ sparse matrix/math/iterative solver library, for a simulation tool I manage. I would have preferred to use an existing package, however, after extensive investigation, none were found that were appropriate for our simulator (we looked at flens, it++, PetSC, eigen, and several others). The good news is my solvers and sparse matrix structures are now very efficient and robust. The bad news is, I am now looking into parallelization using OpenMP, and the learning curve is a bit steep.

The domain we solve can be broken into sub-domains, which come together in a block-diagonal format. So our storage scheme ends up looking like an array of smaller square matrices (blocks[]) each with a format appropriate to the sub-domain (e.g. Compressed Row Storage: CRS, Compressed Diagonal Storage: CDS, Dense, etc..), and a background matrix (currently using CRS) that accounts for the connectivity between sub-domains.

The "hot spot" in most (all?) iterative solvers is the Matrix Vector multiplication operation, and this is true of my library. Thus, I've been focusing my efforts on optimizing my MxV routines. For the block diagonal structure, the pseudo code for M*x=b would be as follows:

b=background_matrix*x
start_index = 1;
end_index = 0;
for(i=1:number of blocks) {
    end_index=start_index+blocks[i].numRows();
    b.range(start_index, end_index) += blocks[i] * x.range(start_index, end_index);
    start_index = end_index+1;
}

where background_matrix is the background (CRS) matrix, blocks is the array of sub-domain matrices, and .range returns the portion of the vector from a starting index to an ending index.

Obviously the loop can be (and has been) parallelized, as the operations are independent of other iterations of the loop (the ranges are non-overlapping). Since we have 10-15 blocks in a typical system, 4+ threads actually makes a significant difference.

The other place where parallelization has been seen to be a good option in is in the MxV operation for each sub-domain storage scheme (calls in lines 1 and 6 in the above code). There is plenty out there on parallelizing CRS, CDS, and dense matrix MxV operations. Typically a nice boost is seen with 2 threads, with greatly diminishing returns as more threads are added.

I am envisioning a scheme, where 4 threads would be used in the block loop for the above code, and each of those threads would use 2 threads for the sub-domain solves. However, I am not sure how, using OpenMP, one would manage the pool of threads- is it possible to limit the number of threads in an openmp for loop? Is this multi-level parallelism something that in practice makes sense? Any other thoughts on what I've proposed here would be appreciated (and thanks for reading all the way to the end!)

A: 

Why not ask the experts over at OpenMP.org

Register and log in at: http://openmp.org/forum/viewforum.php?f=3

rchrd
Thanks rchrd- I actually didn't know openMP had a forum. I'll start perusing over there, to see what I can learn.
MarkD
+3  A: 

Please note that everything I describe is implementation dependent.

Is it possible to limit the number of threads in an openmp for loop?

Yes. There are different ways to do so. Set omp_set_nested(1); and use something like #pragma omp parallel for num_threads(4) or similar in your outer loop and #pragma omp parallel for num_threads(2) directive in your inner loop. This should give you 8 threads (depending on the implementation, you possibly also have to set OMP_THREAD_LIMIT if you have less than 8 cores)

Alternatively, you can manually unroll your loops, e.g. using something like

#pragma omp parallel sections {
     #pragma omp section 
     do your stuff for the first part, nest parallel region again
     #pragma omp section 
     and so on for the other parts
}

You can do the same thing sometimes more efficiently in OpenMP 3.0 with #pragma omp task.

Or you start 8 threads and get the current thread number within the parallel section and schedule manually based on the thread number.

Finally, if you have a perfectly nested loop (a loop is perfectly nested, if the actual assignment only happens in the innermost loop), you can rewrite everything into a single loop. Basically pack your two iterators i and j into one big iterator (i, j). Note that this might reduce locality and therefore decrease performance

Is this multi-level parallelism something that in practice makes sense?

It depends, and you have to find out yourself. In general, multi-level parallelism makes your problem more scalable. Scheduling, however, can be more complicated. This paper might be interesting.

Regarding setting the number of threads manually: the main advantage of setting the numbers of threads is that you can use specific knowledge about your problem when scheduling. Thereby, you can reduce overhead and get higher locality of the code being executed and hence more cache hits and less main memory I/O.

The main disadvantage of manually setting the number of threads in nested parallelism is that threads in the innermost loop might wait idly at the implicit barrier, while additional work could be done (example). Also, coarse-grained parallelism doesn't scale well. So if your outer loop has very different run-time within the loop, you want to to schedule more flexibly than simply splitting into 4 threads.

Any other thoughts

Have you though about doing the MxV with SIMD. Depending on the architecture, this can give a speed-up of 2-4. I quickly googled this presentation for you.

For MxV, loop tiling, register and cache blocking and related techniques can increase data locality and reduce other issues, e.g. false sharing. This book, chapter 11 (you can preview it), might give you some additional ideas on how to restructure data access.

stephan
stephan- thank you very much for this very informative post, and for the links- I'll definitely be reading up on the scheduling. A lot here to digest.As for using SIMD for the MxV, I have implemented an SSE MxV routine for the dense matrix multiplication, and it does have a very nice speed-up. Unfortunately, from what I've read, the memory access pattern for a sparse CRS or CDS MxV operation typically prevents vectorization. I have read a couple of papers that show methods by which to gain some from SIMD for sparse MxV, but haven't had the time to really dive into it.
MarkD
@MarkD: glad it helps. I don't know much about SIMD with sparse data, I must admit. I added a link on cache blocking that might be helpful, though.
stephan