My question is about how best to structure my (C++) code to support parallelizing a time consuming computation. The pseudocode in question has the following structure:
for a_1(y_1) in A_1
for a_2(y_2) in A_2(a_1)
...
for a_n(y_n) in A_n(a_1, ..., a_{n-1})
y_n = f(a_1, ..., a_n)
y_{n-1} = g_n(Y_n)
...
y_1 = g_2(Y_2)
.
Roughly speaking, each loop iterates over elements in a set A_i
, the successive elements of which are dependent on feedback y_i
from previous iterations. In other words, to determine the next a_i
, we must have finished all computations on the current a_i
. Furthermore, interior sets depend on the outer iterations. Written in a recursive form:
Iterate(A_i, a_1, ..., a_{i-1}):
for a_i(h_i) in A_i
Y_i += Iterate(A_{i+1}, a_1, ..., a_i)
return g(Y_i)
Iterate(any, a_1, ..., a_n):
return f(a_1, ..., a_n)
Iterate(A_1)
Assume that f(...) is a time-consuming computation, and that the feedback functions g(...) are simple (fast). Now, if all the sets A_i
are "large", then the problem is embarrassingly parallelizable. Currently, I have a thread pool and just toss the computations of the inner-most loop into the pool. The problem is, very often the inner-most loop is an iteration over a singleton, so the thread pool only ever has one running thread in it. I have thought about using futures to return the values to outer loops, but that would require futures-of-futures, etc. and it gets pretty messy (I think).
I realize that the structure I have listed above is pretty complicated, so there are a number of simplifying cases I am also interested in:
a_i(h_i) = a_i
; independent of h_iA_i(a_1, ..., a_{i-1}) = A_i
; independent of a_1, ... a_{i-1}g_i = 0
; independent of H_{i+1}- All outer loops are "large"; the number of elements in those sets is much greater than the number of cores.
Now, in practice, n <= 3, and item 1 holds for all outer loops, and items 2-4 all hold, so particular solutions for that case are sufficient. But since I am bothering to ask the question here, I am interested in getting ideas for how to deal with the additional complexity for more general problems if possible.
Edit:
Cleaned up the first pseudocode block to make it consistent with the other. Since people cannot understand my mathematical notation, here is a more concrete simple example:
#include <cmath>
#include <iostream>
#include <vector>
using namespace std;
double f(double a1, int a2, double a3){ // Very slow function
cout << a1 << ", " << a2 << ", " << a3 << endl;
return pow(a1*a3, a2) + a1 + a2 + a3; // just some contrived example
}
int g2(const vector<double> &Y3){ // average-ish
double sum = 0;
for(int i = 0; i < Y3.size(); ++i){ sum += Y3[i]; }
return int(sum / (Y3.size() + 1));
}
double g1(const vector<int> &Y2){ // return 1/(min(Y2)+1.0)
int imin = 0; int minval = 0;
for(int i = 1; i < Y2.size(); ++i){
if(Y2[i] < minval){
imin = i;
minval = Y2[imin];
}
}
return 1.0/double(minval+1.0);
}
int main(){
for(double a1 = 0.0, h1 = 10.0; a1 < 1.0; a1 += h1){ // for a1 in A1
vector<int> Y2;
for(int a2 = 2, h2 = 1; a2 <= (int)(5*a1+10); a2 += h2){ // for a2 in A2(a1)
vector<double> Y3;
for(double a3 = 6.0, h3 = 1.0; a3 >= (a1+a2); a3 -= 0.5*h3){ // for a3 in A2(a1, a2)
h3 = f(a1, a2, a3);
Y3.push_back(h3);
}
h2 = g2(Y3);
Y2.push_back(h2);
}
h1 = g1(Y2);
}
return 0;
}
I picked the values randomly, and it turns out f
is only evaluated 3 times. Note that the above code is NOT parallelizable. Assume that it is possible to query if a loop's incrementation depends on higher up loops.
I should clarify what I'm after as well. When I originally said structure, I perhaps should have said parallelization methodology or something like that. For example, my first attempt to parallelize was to throw the inner-most calls to f
into a thread pool and join at the end of the inner-most loop. As mentioned above, this doesn't work when the inner-most loop iterates only over one element. This didn't really require restructuring the existing code significantly, and I would like to avoid it if possible.