I have implemented an iterative algorithm, where each iteration involves a pre-order tree traversal (sometimes called downwards accumulation) followed by a post-order tree traversal (upwards accumulation). Each visit to each node involves calculating and storing information to be used for the next visit (either in the subsequent post-order traversal, or the subsequent iteration).
During the pre-order traversal, each node can be processed independently as long as all nodes between it and the root have already been processed. After processing, each node needs to pass a tuple (specifically, two floats) to each of its children. On the post-order traversal, each node can be processed independently as long as all of it's subtrees (if any) have already been processed. After processing, each node needs to pass a single float to its parent.
The structure of the trees is static and unchanged during the algorithm. However, during the course of the downward traversal, if the two floats being passed both become zero, the entire subtree under this node does not need to be processed, and the upwards traversal for this node can begin. (The subtree must be preserved, because the passed floats on subsequent iterations may become non-zero at this node and traversals would resume).
The intensity of computation at each node is the same across the tree. The computation at each node is trivial: Just a few sums and multiply/divides on a list of numbers with length equal to the number of children at the node.
The trees being processed are unbalanced: a typical node would have 2 leaves plus 0-6 additional child nodes. So, simply partitioning the tree into a set of relatively balanced subtrees is non-obvious (to me). Further, the trees are designed to consume all available RAM: the bigger tree that I can process, the better.
My serial implementation attains on the order of 1000 iterations per second on just my little test trees; with the "real" trees, I expect it might slow by an order of magnitude (or more?). Given that the algorithm requires at least 100 million iterations (possibly up to a billion) to reach an acceptable result, I'd like to parallelize the algorithm to take advantage of multiple cores. I have zero experience with parallel programming.
What is the recommended pattern for parallelization given the nature of my algorithm?