views:

234

answers:

2

Long version:

I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance.

However, I'm struggling to understand exactly how fault-tolerance is achieved in this system where messages are queued in transient memory. I understand that a supervisor hierarchy can be arranged to respawn deceased processes, but I've been unable to find much discussion of the implications of respawning on works-in-progress. What happens to in-flight messages and the artifacts of partially-completed work that were lost on a dying node?

Will all producers automatically retransmit messages that are not ack'd when consumer processes die? If not, how can this be considered fault-tolerant? And if so, what prevents a message that was processed -- but not quite acknowledged -- from being retransmitted, and hence reprocessed inappropriately?

(I recognize that these concerns are not unique to erlang; similar concerns will arise in any distributed processing system. But erlang enthusiasts seem to claim that the platform makes this all "easy"..?)

Assuming messages are retransmitted, I can easily envision a scenario where the downstream effects of a complex messaging chain could become very muddled after a fault. Without some sort of heavy distributed transaction system, I don't understand how consistency and correctness can be maintained without addressing duplication in every process. Must my application code always enforce constraints to prevent transactions from being executed more than once?

Short version:

Are distributed erlang processes subject to duplicated messages? If so, is duplicate-protection an application responsibility, or does erlang/OTP somehow help us with this?

+2  A: 

The erlang OTP system is fault tolerant. That doesn't relieve you of the need to build equally fault tolerant apps in it. If you use erlang and OTP then there are a few things you can rely on.

  1. When a process dies that process will be restarted.
  2. For the most part a process crashing won't bring down your whole app
  3. When a message is sent it will be received provided the receiver exists.

As far as I know messages in erlang are not subject to duplication. If you send a message and the process receives it then the message is gone from the queue. However if you send a message and the process receives that message but crashes while processing it then that message is gone and unhandled. That fact should be considered in the design of your system. OTP helps you handle all of this by using processes to isolate infrastructure critical code (eg. supervisors, gen_servers, ...) from application code that might be subject to crashes.

For instance you might have a gen_server that dispatches work to a process pool. The processes in the pool might crash and get restarted. But the gen_server remains up since its entire purpose is just to recieve messages and dispatch them to the pool to work on. This allows the whole system to stay up despite errors and crashes in the pool and there is always something waiting for your message.

Just because the system is fault tolerant doesn't mean your algorithm is.

Jeremy Wall
+21  A: 
I GIVE TERRIBLE ADVICE
We need a way to donate some of our points to pump answers up by more than +1. Seriously good work here.
JUST MY correct OPINION