Long version:
I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance.
However, I'm struggling to understand exactly how fault-tolerance is achieved in this system where messages are queued in transient memory. I understand that a supervisor hierarchy can be arranged to respawn deceased processes, but I've been unable to find much discussion of the implications of respawning on works-in-progress. What happens to in-flight messages and the artifacts of partially-completed work that were lost on a dying node?
Will all producers automatically retransmit messages that are not ack'd when consumer processes die? If not, how can this be considered fault-tolerant? And if so, what prevents a message that was processed -- but not quite acknowledged -- from being retransmitted, and hence reprocessed inappropriately?
(I recognize that these concerns are not unique to erlang; similar concerns will arise in any distributed processing system. But erlang enthusiasts seem to claim that the platform makes this all "easy"..?)
Assuming messages are retransmitted, I can easily envision a scenario where the downstream effects of a complex messaging chain could become very muddled after a fault. Without some sort of heavy distributed transaction system, I don't understand how consistency and correctness can be maintained without addressing duplication in every process. Must my application code always enforce constraints to prevent transactions from being executed more than once?
Short version:
Are distributed erlang processes subject to duplicated messages? If so, is duplicate-protection an application responsibility, or does erlang/OTP somehow help us with this?