views:

473

answers:

3

I have implemented an iterative algorithm, where each iteration involves a pre-order tree traversal (sometimes called downwards accumulation) followed by a post-order tree traversal (upwards accumulation). Each visit to each node involves calculating and storing information to be used for the next visit (either in the subsequent post-order traversal, or the subsequent iteration).

During the pre-order traversal, each node can be processed independently as long as all nodes between it and the root have already been processed. After processing, each node needs to pass a tuple (specifically, two floats) to each of its children. On the post-order traversal, each node can be processed independently as long as all of it's subtrees (if any) have already been processed. After processing, each node needs to pass a single float to its parent.

The structure of the trees is static and unchanged during the algorithm. However, during the course of the downward traversal, if the two floats being passed both become zero, the entire subtree under this node does not need to be processed, and the upwards traversal for this node can begin. (The subtree must be preserved, because the passed floats on subsequent iterations may become non-zero at this node and traversals would resume).

The intensity of computation at each node is the same across the tree. The computation at each node is trivial: Just a few sums and multiply/divides on a list of numbers with length equal to the number of children at the node.

The trees being processed are unbalanced: a typical node would have 2 leaves plus 0-6 additional child nodes. So, simply partitioning the tree into a set of relatively balanced subtrees is non-obvious (to me). Further, the trees are designed to consume all available RAM: the bigger tree that I can process, the better.

My serial implementation attains on the order of 1000 iterations per second on just my little test trees; with the "real" trees, I expect it might slow by an order of magnitude (or more?). Given that the algorithm requires at least 100 million iterations (possibly up to a billion) to reach an acceptable result, I'd like to parallelize the algorithm to take advantage of multiple cores. I have zero experience with parallel programming.

What is the recommended pattern for parallelization given the nature of my algorithm?

+1  A: 

The usual method is to use some kind of depth-first work-splitting. You start with a number of workers waiting on an idle queue, and one worker starting a traversal at the root. A worker with work traverses depth first, and whenever it is at a node with more than one child left to be done, it checks the idle worker queue and, if that's not empty, farms off a subtree (child) to another worker. There's some complication handling the joining when a worker finishes a subtree, but in general this can work well for a variety of tree structures (balanced or unbalanced)

Chris Dodd
The usual method demands that every programmer asking for parallel speedup work directly with two of the hardest-to-use constructs in the catalog, threads and shared data structures. It would also require jumping through excessive hoops as soon as the programmer wants to extend the parallelism beyond a single shared-memory machine.
Novelocrat
+1  A: 

Try to rewrite your algorithm to be composed of pure functions. That means that every piece of code is essentially a (small) static function with no dependence on global variables or static variables, and that all data is treated as immutable--- changes are only made to copies--- and all functions only manipulate state (in a loose sense of the word "manipulate") by returning (new) data.

If every function is referentially transparent--- it only depends on its input (and no hidden state) to compute its output, and every function call with the same input always yields the same output--- then you are in a good position to parallelize the algorithm: since your code never mutates global variables (or files, servers, etc.) the work a function does can be safely repeated (to recompute the function's result) or completely ignored (no future code depends on this function's side effects, so skipping a call completely won't break anything). Then when you run your suite of functions (for example on some implementation of MapReduce, hadoop, etc.) the chain of functions will cause a magical cascade of dependencies based solely on the output of one function and the input of another function, and WHAT you are trying to compute (via pure functions) will be completely separate from the ORDER in which you are trying to compute it (a question answered by the implementation of a scheduler for a framework like MapReduce).

A great place to learn this mode of thinking is write your algorithm in the programming language Haskell (or something F# or Ocaml) which has great support for parallel/multicore programming, out of the box. Haskell forces your code to be pure so if your algorithm works, it IS probably easily parallelizable.

Jared Updike
Some interesting ideas here. Haskell looks intriguing. If I spend the time learning it to port the algorithm there, will I also have to learn MapReduce/hadoop, or are these disjoint approaches?
travis
Recoding in Haskell would be separate from recoding in a MapReduce/Hadoop environment. The choice really depends on the scale of tree you're trying to work with.
Novelocrat
Haskell is a little mind-bending to learn (in a good way). One of the best books was recently published by O'Reilly and is available online for free: http://book.realworldhaskell.org/read/ Chapter 24 goes into detail about approaches to parallel programming, with specific examples using MapReduce: http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.html
Jared Updike
+1  A: 

This answer describes how I would do it with the parallel language and runtime system that I work on day-to-day, Charm++. Note that the language used for sequential code within this framework is C or C++, so you would have to put in some effort at porting the computational code. Charm++ does have some mechanisms for interoperating with Python code, though I'm less familiar with those aspects. It's possible that you could keep the driver and interface code in Python and just put the heavy computational code in C++. Regardless, the porting effort on the sequential code would likely gain you a good next increment of performance.

Design:

Create an array of parallel objects (called chares in our environment), and assign each a worklist of interior tree nodes starting at some subtree root and extending partways downwards. Any leaves attached to those nodes would also belong to that chare.

Each parallel object would need two asynchronous remotely invocable methods, known as entry methods, passDown(float a, float b) and passUp(int nodeID, float f), which would be the points of communication among them. passDown would call whatever node method is used to do the pre-order computation, and nodes that have out-of-object children would call passDown on those descendant objects.

Once all the downward work is done, the object would compute the upward work on its leaves and wait for its descendants. Invocations of passUp would compute as far up into the receiving object's tree as it can until it hits a parent that hasn't received data from all its children. When the object's root node is done with upward work, it would call passUp on the object holding the parent node. When the root of the whole tree is done, you know an iteration has completed.

Runtime results:

Once this is implemented, the runtime system handles the parallel execution for you. It will distribute the objects among processors, and even across separate compute nodes (hence dramatically raising your tree size ceiling, since your available memory can scale much higher). Communication across processors and nodes looks just like in-process communication - asynchronous method calls. The runtime can load-balance the objects to keep all of your processors busy throughout as much of each iteration as possible.

Tuning:

If you go this way and get to the point of tuning parallel performance, you can also set priorities on the messages to keep critical path length short. Off the top of my head, the prioritization I'd suggest would do work in this order

  1. Downward work that's non-zero
    • Closer to the root goes sooner
  2. Upward work
    • Closer to leaves goes sooner

Charm++ works with a performance analysis tool known as Projections to gain further insight into how your program is performing.

Novelocrat