tags:

views:

71

answers:

2

I need to build a system where we have a set of machines produce unique files, lets call them PRODUCERS and a set of machines receive these files which we call CONSUMERS. Any machine from the PRODUCERS can send files to one or more of the CONSUMERS[ based on some hash mechanism]. I need to build a mechanism which ensures that the file delivery happens in a guaranteed mannner. i.e. Producers or consumers may crash/reboot and be able to continue from where they left off. Is there any fool proof scalable way to implement this, it seems to be quite a common need in any fault tolerant system? The number of producers and consumers are expected to increase/decrease on the fly.

A: 

What you describe is probably most easily implemented using some form of message passing. You might want to take a look at http://www.zeromq.org; I've worked with this library myself and can whole-heartedly recommend it.

On a side-note: if you don't need to use C++, picking up some Erlang might be interesting in your case.

Greg S
+1  A: 

What you're describing sounds a bit like the replication mechanism of the Google File System architecture. You'll be most interested in section 3.1 and 3.2 of the paper, together with the illustration in figure 2.

A summary (with simplifications) as it applies to your case:

  1. PRODUCER sends the data, waits for a reply.
  2. CONSUMER(s) reply, "I have received all the data."
  3. PRODUCER sends a "finish write" command, waits for a reply.
  4. CONSUMER(s) reply, "I have flushed the data to disk."
  5. Now (and only now) consider the data, "saved."

GFS as described in the paper implements a number of optimizations, including pipelining the write to consumers instead of splitting one machine's bandwidth across n machines simultaneously.

To further your safety guarantees through crashes, you can make the write operations idempotent using an Intent Log. This could either be at the producer's end only (you retry after a timeout, for instance), or at the consumers' end as well (on reboot, continue the operation).

Andres Jaan Tack