views:

22

answers:

3

I want to easily perform collective communications indepandently on each machine of my cluster. Let say I have 4 machines with 8 cores on each, my mpi program would run 32 MPI tasks. What I would like is, for a given function:

  • on each host, only one task perform a computation, other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
  • once the computation is done, each MPI tasks on each will perform a collective communication ONLY to local tasks (tasks running on the same host).

Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:

  • is MPI_Get_processor_name is enough unique for such a behaviour?
  • more generally, do you have a piece of code doing that?

Thanks

A: 

Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have

  • machine 1: MPI ranks 0-7
  • machine 2: MPI ranks 8-15
  • machine 3: MPI ranks 16-23
  • machine 4: MPI ranks 24-31

And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.

janneb
+1  A: 

I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.

The response of the system to a call to MPI_Get_processor_name is system-dependent; on your system it might return node00, node01, node02, node03 as appropriate, or it might return my_big_computer for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.

One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.

Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.

A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about MPI_Comm_connect and MPI_Open_port for details.

Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.

High Performance Mark
All your three method are interesting, but I can not use them. I am working on a real big app, and can not change things like that.
Jérôme
+2  A: 

The specification says that MPI_Get_processor_name returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.

You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.

But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.

Jonathan Dursi
MPI_Comm_split seems for me to be the best solution. I am testing it, but I wonder how OpenMPI handles it. If tasks in a communicator all belong to the same host, is OpenMPI smart enough to perform a Bcast with shared memory only? Is it possible to assign a policy to a communicator or bcast?
Jérôme
About your last question:Why is it better this way? I am working on a small part of a big HPC program. Hybrid approaches MPI + OpenMP have been tested, but the version I am working on is pure MPI. At this stage, 1 core = 1 MPI task. At some point in the program, MPI tasks call a Lapack function. All cores perform the same function on the same data. The idea is to perform this function on each host, but by only 1 MPI task, using a parallel implementation of this function. I am only hopping that ( Lapack_fc / nb_core + Bcast_time ) < Lapack_fc
Jérôme
All MPIs on the market are smart enough to use shared memory for local-only communications, and in fact will typically implement even global collectives via local operations first then global. Basically, anything obvious is already implmemented (unless there's a non-obvious reason that it's difficult). As to why hybrid (can be) better -- if throughout your code you have this distinction between on and off-node parallelism, your parallism model may as well reflect this. On the other hand, if this is just a small "one-off" inside a larger code, then maybe flat MPI is best.
Jonathan Dursi