views:

117

answers:

4

My question is about how best to structure my (C++) code to support parallelizing a time consuming computation. The pseudocode in question has the following structure:

for a_1(y_1) in A_1
    for a_2(y_2) in A_2(a_1)
     ...
      for a_n(y_n) in A_n(a_1, ..., a_{n-1})
       y_n = f(a_1, ..., a_n)
      y_{n-1} = g_n(Y_n)
     ...
    y_1 = g_2(Y_2)
.

Roughly speaking, each loop iterates over elements in a set A_i, the successive elements of which are dependent on feedback y_i from previous iterations. In other words, to determine the next a_i, we must have finished all computations on the current a_i. Furthermore, interior sets depend on the outer iterations. Written in a recursive form:

Iterate(A_i, a_1, ..., a_{i-1}):
    for a_i(h_i) in A_i
     Y_i += Iterate(A_{i+1}, a_1, ..., a_i)
    return g(Y_i)

Iterate(any, a_1, ..., a_n):
    return f(a_1, ..., a_n)

Iterate(A_1)

Assume that f(...) is a time-consuming computation, and that the feedback functions g(...) are simple (fast). Now, if all the sets A_i are "large", then the problem is embarrassingly parallelizable. Currently, I have a thread pool and just toss the computations of the inner-most loop into the pool. The problem is, very often the inner-most loop is an iteration over a singleton, so the thread pool only ever has one running thread in it. I have thought about using futures to return the values to outer loops, but that would require futures-of-futures, etc. and it gets pretty messy (I think).

I realize that the structure I have listed above is pretty complicated, so there are a number of simplifying cases I am also interested in:

  1. a_i(h_i) = a_i ; independent of h_i
  2. A_i(a_1, ..., a_{i-1}) = A_i ; independent of a_1, ... a_{i-1}
  3. g_i = 0 ; independent of H_{i+1}
  4. All outer loops are "large"; the number of elements in those sets is much greater than the number of cores.

Now, in practice, n <= 3, and item 1 holds for all outer loops, and items 2-4 all hold, so particular solutions for that case are sufficient. But since I am bothering to ask the question here, I am interested in getting ideas for how to deal with the additional complexity for more general problems if possible.


Edit:

Cleaned up the first pseudocode block to make it consistent with the other. Since people cannot understand my mathematical notation, here is a more concrete simple example:

#include <cmath>
#include <iostream>
#include <vector>
using namespace std;

double f(double a1, int a2, double a3){ // Very slow function
    cout << a1 << ", " << a2 << ", " << a3 << endl;
    return pow(a1*a3, a2) + a1 + a2 + a3; // just some contrived example
}

int g2(const vector<double> &Y3){ // average-ish
    double sum = 0;
    for(int i = 0; i < Y3.size(); ++i){ sum += Y3[i]; }
    return int(sum / (Y3.size() + 1));
}

double g1(const vector<int> &Y2){ // return 1/(min(Y2)+1.0)
    int imin = 0; int minval = 0;
    for(int i = 1; i < Y2.size(); ++i){
     if(Y2[i] < minval){
      imin = i;
      minval = Y2[imin];
     }
    }
    return 1.0/double(minval+1.0);
}

int main(){
    for(double a1 = 0.0, h1 = 10.0; a1 < 1.0; a1 += h1){ // for a1 in A1
     vector<int> Y2;
     for(int a2 = 2, h2 = 1; a2 <= (int)(5*a1+10); a2 += h2){ // for a2 in A2(a1)
      vector<double> Y3;
      for(double a3 = 6.0, h3 = 1.0; a3 >= (a1+a2); a3 -= 0.5*h3){ // for a3 in A2(a1, a2)
       h3 = f(a1, a2, a3);
       Y3.push_back(h3);
      }
      h2 = g2(Y3);
      Y2.push_back(h2);
     }
     h1 = g1(Y2);
    }

    return 0;
}

I picked the values randomly, and it turns out f is only evaluated 3 times. Note that the above code is NOT parallelizable. Assume that it is possible to query if a loop's incrementation depends on higher up loops.

I should clarify what I'm after as well. When I originally said structure, I perhaps should have said parallelization methodology or something like that. For example, my first attempt to parallelize was to throw the inner-most calls to f into a thread pool and join at the end of the inner-most loop. As mentioned above, this doesn't work when the inner-most loop iterates only over one element. This didn't really require restructuring the existing code significantly, and I would like to avoid it if possible.

A: 

To be honest, your notation is hard to grok at first glance (for me at least). Perhaps if you could be more explicit or possibly use C or C++ code. What is your method of parallelization (pthreads, openmp, etc)? I suspect one issue is that you can improve your load balancing. For instance, you may not want to assign work to threads in a card dealing fashion.

BobbyShaftoe
I currently use pthreads. With regard to the notation, yes, I know. The problem is that all the sets contain objects of different types, so writing it out in a C-like manner would require a lot of extra details. The += in Iterate should really mean "add to set".
Victor Liu
A: 

I'm also having a hard time reading this. If you could express it in terms of sequential C++ code, stl algorithms we could likely help here.

Rick
A: 

If at all possible, the best way to speed up a deeply nested set of calls like this is to not have a deeply nested set of calls.

You can often re-organize your data or the links inside your data to have references that might save you a level of looping, or you can sometimes find a way to line up the loops one after the other, storing the intermediate information. Sometimes it even takes creating a different object structure.

I'm not saying this always works, but removing even one level will be a MUCH more significant reduction in time than anything else you might try.

If I could understand your psuedocode I might give it a try, but I'm guessing you've abstracted out most of the structure that would be required for an insightful design anyway.

Bill K
+2  A: 

You could try to express your problem in form of a map-reduce problem (http://en.wikipedia.org/wiki/MapReduce), making each level of nesting a single map-reduce job. The for loop would be translated to the mapping, and the g_i call to the reduction step.

You could try to make your pseudolanguage a bit more clear... maybe express it as python program with n=3 or n=4? Is your "for" a regular for loop? If so, I don't really understand the first pair of parentheses.

I'm not really sure if your problem is parallelizable in stated form. If you say that the loop's variable depends on previous iteration, then it looks more like a sequential problem to me.

liori
This is exactly what I'd do. If you can arrange it so that each step can be done using message semantics, it'll make your life easier.
kyoryu
You're right in that it would not be parallelizable if the loop's variable is strictly dependent on the previous iteration. But sometimes it's only dependent on values from 3 iterations ago, or not at all. Supposing you were able to query if such a dependency exists, is it possible to do better than just plain sequential? The thing is I don't want to special case if each for-loop is over a singleton (or a small set) or not.
Victor Liu
Try Haskell's metacompiler. It might deduce data dependency between iterations and parallelize code without hints, regardless of what special case represents your code. Then you (probably) will not have to explicitly write parallel code for each case from scratch.
liori