tags:

views:

272

answers:

2

I'm using MPI (with fortran but the question is more specific to the MPI standard than any given language), and specifically using the buffered send/receive functions isend and irecv. Now if we imagine the following scenario:

Process 0:

isend(stuff1, ...)
isend(stuff2, ...)

Process 1:

wait 10 seconds
irecv(in1, ...)
irecv(in2, ...)

Are the messages delivered to Process 1 in the order they were sent, i.e. can I be sure that in1 == stuff1 and in2 == stuff2 if the tag used is the same in all cases?

+3  A: 

Yes, the messages are received in the order they are sent. This is described by the standard as non-overtaking messages. See this MPI Standard section for more details, here's an excerpt:

Order Messages are non-overtaking: If a sender sends two messages in succession to the same destination, and both match the same receive, then this operation cannot receive the second message if the first one is still pending. If a receiver posts two receives in succession, and both match the same message, then the second receive operation cannot be satisfied by this message, if the first one is still pending. This requirement facilitates matching of sends to receives. It guarantees that message-passing code is deterministic, if processes are single-threaded and the wildcard MPI_ANY_SOURCE is not used in receives. (Some of the calls described later, such as MPI_CANCEL or MPI_WAITANY, are additional sources of nondeterminism.)

Edric
A: 

Yes and no.

can I be sure that in1 == stuff1 and in2 == stuff2 if the tag used is the same in all cases?

Yes. There is a deterministic 1:1 correlation between send's and recv's that will get the correct input into the correct recv buffer. This behavior is guaranteed by the standard, and is enforced by all MPI implementations.

No. The exact order of internal message progression and the exact order in which buffers on the receiver side are populated is somewhat of a black box....especially when RDMA style message transfers with multiple in-flight buffers are being used (e.g. InfiniBand).

If your code is using multiple threads, and inspecting the buffer to determine completeness (e.g. waiting on a bit to be toggled) rather than using MPI_Test or MPI_Wait, then it is possible that the messages can arrive out of order (but in the correct buffer).

If your code is dependent on the in1 = stuff1 being populated BEFORE in2 = stuff2 is populated on the receiver side, and there is a single sending rank for both messages, then using MPI_Issend (non-blocking, synchronous send) will guarantee the messages are recv'd in order. If you need to guarantee the buffer population order of multiple recv's from multiple sending ranks, then some kind of blocking call is required between each revc (e.g. MPI_Recv, MPI_Barrier, MPI_Wait, etc).

semiuseless
Have to -1 as I can't quite believe the "No" part of your "Yes and no" answer... It's ludicrous to "check for completion" by the means you suggest there. Equivalently, if I asked you "Is `x = 42; printf("%d", x);` guaranteed to print 42?" you could just as well say "Yes and no; no since if you were single-stepping through a debugger and altered the value of `x` then it wouldn't."
j_random_hacker
@j...The situation described has happened with three different users I support. They job had multi-threaded ranks, with a single MPI communication thread. The other threads were interacting with hardware controllers. A hardware controller thread was inspecting the final bit of the buffer to determine message completion. When they moved from TCP to IB, the out of order message arrival issue was revealed. This is more like a "compiler optimization" rather than the debugger example you gave. The fabric manager preserving the buffer order, but optimized the actual transmission order.
semiuseless
Well I'm stunned. (That's what MPI_Test() is *for*!) But if 3 users got this wrong then the info in your post is valuable, so if you edit it to add a big loud "BUT DON'T DO THIS, IT'S STUPID" or somesuch, I'll +1.
j_random_hacker
I am less concerned with the specific example, rather than the broad idea: with RDMA the exact order in which buffers on the receiver side are populated is somewhat of a black box.Using MPI_Test or MPI_Wait (or the all/any variations) is the "front door" method for determining when a message transfer is complete. I specifically mentioned multi-threaded case in my example. Making MPI calls from multiple threads introduces a performance penalty. Using MPI_Test/Wait from one thread to check on the completeness of a message transfer in another thread may not be appropriate.
semiuseless
If you need to make MPI calls in different threads, then *you need to make MPI calls in different threads* and you must pay the performance penalty of using a threadsafe MPI implementation -- the alternative, as you've shown, is undefined behaviour. I don't understand why you don't want to make that point explicit, but OK.
j_random_hacker