I am currently working on a C++ sparse matrix/math/iterative solver library, for a simulation tool I manage. I would have preferred to use an existing package, however, after extensive investigation, none were found that were appropriate for our simulator (we looked at flens, it++, PetSC, eigen, and several others). The good news is my solvers and sparse matrix structures are now very efficient and robust. The bad news is, I am now looking into parallelization using OpenMP, and the learning curve is a bit steep.
The domain we solve can be broken into sub-domains, which come together in a block-diagonal format. So our storage scheme ends up looking like an array of smaller square matrices (blocks[]) each with a format appropriate to the sub-domain (e.g. Compressed Row Storage: CRS, Compressed Diagonal Storage: CDS, Dense, etc..), and a background matrix (currently using CRS) that accounts for the connectivity between sub-domains.
The "hot spot" in most (all?) iterative solvers is the Matrix Vector multiplication operation, and this is true of my library. Thus, I've been focusing my efforts on optimizing my MxV routines. For the block diagonal structure, the pseudo code for M*x=b would be as follows:
b=background_matrix*x
start_index = 1;
end_index = 0;
for(i=1:number of blocks) {
end_index=start_index+blocks[i].numRows();
b.range(start_index, end_index) += blocks[i] * x.range(start_index, end_index);
start_index = end_index+1;
}
where background_matrix is the background (CRS) matrix, blocks is the array of sub-domain matrices, and .range returns the portion of the vector from a starting index to an ending index.
Obviously the loop can be (and has been) parallelized, as the operations are independent of other iterations of the loop (the ranges are non-overlapping). Since we have 10-15 blocks in a typical system, 4+ threads actually makes a significant difference.
The other place where parallelization has been seen to be a good option in is in the MxV operation for each sub-domain storage scheme (calls in lines 1 and 6 in the above code). There is plenty out there on parallelizing CRS, CDS, and dense matrix MxV operations. Typically a nice boost is seen with 2 threads, with greatly diminishing returns as more threads are added.
I am envisioning a scheme, where 4 threads would be used in the block loop for the above code, and each of those threads would use 2 threads for the sub-domain solves. However, I am not sure how, using OpenMP, one would manage the pool of threads- is it possible to limit the number of threads in an openmp for loop? Is this multi-level parallelism something that in practice makes sense? Any other thoughts on what I've proposed here would be appreciated (and thanks for reading all the way to the end!)