I have implemented an iterative algorithm, where each iteration involves a pre-order tree traversal (sometimes called downwards accumulation) followed by a post-order tree traversal (upwards accumulation). Each visit to each node involves calculating and storing information to be used for the next visit (either in the subsequent post-order traversal, or the subsequent iteration).
During the pre-order traversal, each node can be processed independently as long as all nodes between it and the root have already been processed. After processing, each node needs to pass a tuple (specifically, two floats) to each of its children. On the post-order traversal, each node can be processed independently as long as all of it's subtrees (if any) have already been processed. After processing, each node needs to pass a single float to its parent.
The structure of the trees is static and unchanged during the algorithm. However, during the course of the downward traversal, if the two floats being passed both become zero, the entire subtree under this node does not need to be processed, and the upwards traversal for this node can begin. (The subtree must be preserved, because the passed floats on subsequent iterations may become non-zero at this node and traversals would resume).
The intensity of computation at each node is the same across the tree. The computation at each node is trivial: Just a few sums and multiply/divides on a list of numbers with length equal to the number of children at the node.
The trees being processed are unbalanced: a typical node would have 2 leaves plus 0-6 additional child nodes. So, simply partitioning the tree into a set of relatively balanced subtrees is non-obvious (to me). Further, the trees are designed to consume all available RAM: the bigger tree that I can process, the better.
My serial implementation attains on the order of 1000 iterations per second on just my little test trees; with the "real" trees, I expect it might slow by an order of magnitude (or more?). Given that the algorithm requires at least 100 million iterations (possibly up to a billion) to reach an acceptable result, I'd like to parallelize the algorithm to take advantage of multiple cores. I have zero experience with parallel programming.
What is the recommended pattern for parallelization given the nature of my algorithm?
Try to rewrite your algorithm to be composed of pure functions. That means that every piece of code is essentially a (small) static function with no dependence on global variables or static variables, and that all data is treated as immutable--- changes are only made to copies--- and all functions only manipulate state (in a loose sense of the word "manipulate") by returning (new) data.
If every function is referentially transparent--- it only depends on its input (and no hidden state) to compute its output, and every function call with the same input always yields the same output--- then you are in a good position to parallelize the algorithm: since your code never mutates global variables (or files, servers, etc.) the work a function does can be safely repeated (to recompute the function's result) or completely ignored (no future code depends on this function's side effects, so skipping a call completely won't break anything). Then when you run your suite of functions (for example on some implementation of MapReduce, hadoop, etc.) the chain of functions will cause a magical cascade of dependencies based solely on the output of one function and the input of another function, and WHAT you are trying to compute (via pure functions) will be completely separate from the ORDER in which you are trying to compute it (a question answered by the implementation of a scheduler for a framework like MapReduce).
A great place to learn this mode of thinking is write your algorithm in the programming language Haskell (or something F# or Ocaml) which has great support for parallel/multicore programming, out of the box. Haskell forces your code to be pure so if your algorithm works, it IS probably easily parallelizable.