views:

203

answers:

2

I want parallelize a program. It's not that difficult with threads working on one big data-structure in shared memory. But I want to be able to use distribute it over cluster and I have to choose a technology to do that. MPI is one idea.

The question is what overhead will have MPI (or other technology) if I skip implementation of specialized version for shared memory and let MPI handle all cases ?

Update:

I want to grow a large data structure (game tree) simultaneously on many computers. Most parts of it will be only on one cluster node but some of it (unregular top of the tree) will be shared and synchronized from time to time.

On shared memory machine I would like to have this achieved through shared memory. Can this be done generically?

A: 

It depends on the algorithm. Clealy inter-cluster communication is orders of magnitude slower than shared memory either as inter-process communication or multiple threads within a process. Therefore you want to minimize inter-cluster traffic, E.g. by duplicating data where possible and practicable or breaking the problem down in such a way that minimizes inter node communication.

For 'embarrisngly' parallel algorithms with little inter-node communication it's an easy choice - these are problems like brute force searching for encryption key where each node can crunch numbers for long periods and report back to a central node periodically but no communication is required to test keys.

locster
A: 

All the popular MPI implementations will communicate locally via shared memory. The performance is very good as long as you don't spend all your time packing and unpacking buffers (i.e. your design is reasonable). In fact, the design imposed upon you by MPI can perform better than most threaded implementations because the separate address space improves cache coherence. To consistently beat MPI, the threaded implementations have to be aware of the cache hierarchy and what the other cores are working on.

With good network hardware (like InfiniBand) the HCA is responsible for getting your buffers on and off the network so the CPU can do other things. Also, since many jobs are memory bandwidth limited, they will perform better using, e.g. 1 core on each socket across multiple nodes than when using multiple cores per socket.

Jed