I have a large scientific computing task that parallelizes very well with SMP, but at too fine grained a level to be easily parallelized via explicit message passing. I'd like to parallelize it across address spaces and physical machines. Is it feasible to create a scheduler that would parallelize already multithreaded code across multiple physical computers under the following conditions:
- The code is already multithreaded and can scale pretty well on SMP configurations.
- The fact that not all of the threads are running in the same address space or on the same physical machine must be transparent to the program, even if this comes at a significant performance penalty in some use cases.
- You may assume that all of the physical machines involved are running operating systems and CPU architectures that are binary compatible.
- Things like locks and atomic operations may be slow (having network latency to deal with and all) but must "just work".
Edits:
- I only care about throughput, not latency.
- I'm using the D programming language, and I'm almost sure there's no canned solution. I'm more interested in whether this is feasible in principle than in a particular canned solution.