views:

1721

answers:

2

In one of the answers to Broadcast like UDP with the Reliability of TCP, a user mentions the Spread messaging API. I've also run across one called ØMQ. I also have some familiarity with MPI.

So, my main question is: why would I choose one over the other? More specifically, why would I choose to use Spread or ØMQ when there are mature implementations of MPI to be had?

+2  A: 

I have not used any of these libraries, but I may be able to give some hints.

  1. MPI is a communication protocol while Spread and ØMQ are actual implementation.
  2. MPI comes from "parallel" programming while Spread comes from "distributed" programming.

So, it really depends on whether you are trying to build a parallel system or distributed system. They are related to each other, but the implied connotations/goals are different. Parallel programming deals with increasing computational power by using multiple computers simultaneously. Distributed programming deals with reliable (consistent, fault-tolerant and highly available) group of computers.

The concept of "reliability" is slightly different from that of TCP. TCP's reliability is "give this packet to the end program no matter what." The distributed programming's reliability is "even if some machines die, the system as a whole continues to work in consistent manner." To really guarantee that all participants got the message, one would need something like 2 phase commit or one of faster alternatives.

eed3si9n
+6  A: 

MPI was deisgned tightly-coupled compute clusters with fast, reliable networks. Spread and ØMQ are designed for large distributed systems. If you're designing a parallel scientific application, go with MPI, but if you are designing a persistent distributed system that needs to be resilient to faults and network instability, use one of the others.

MPI has very limited facilities for fault tolerance; the default error handling behavior in most implementations is a system-wide fail. Also, the semantics of MPI require that all messages sent eventually be consumed. This makes a lot of sense for simulations on a cluster, but not for a distributed application.

tgamblin