views:

211

answers:

4

Erlang fault tolerance (as I understand it) includes the use of supervisor processes to keep an eye on worker processes, so if a worker dies the supervisor can start up a new one.

How does Erlang do this monitoring, especially in a distributed scenario? How can it be sure the process has really died? Does it do heart beats? Is something built into the runtime environment? What if a network cable is unplugged - does it assume the other processes have died if it cannot communicate with them? etc.

I was thinking about how to achieve the same fault tolerance etc claimed by Erlang in the JVM (in say Java or Scala). But I was not sure if it required support built into the JVM to do it as well as Erlang. I had not come across a definition of how Erlang does it yet though as a point of comparison.

A: 

It appears that someone has implemented a similar strategy in Scala. My expectation would be that a supervisor would treat a network failure as a failed subprocess, and the documentation on the Scala process seems to bear this out.

jsight
Thanks - it was an interesting post. I left a message there trying to work out if it supported network connections. I got the feeling (possibly incorrect) that it was watching something else within the JVM and was not dealing with cross process boundary issues. But if it all works it would be great!
Alan Kent
A: 

I think you mean by Supervisor process the portmapper. You could utilize the Erlang portmapper/infrastructure via the JInterface - thus you avoid reinventing the wheel - in case you still want it you get at least all interfaces described there.

weismat
Thanks, but I was hoping to only have the Java VM around (no Erlang VM). Keeps things simpler (politically).
Alan Kent
+4  A: 

Erlang OTP Supervision is typically not done between processes on different nodes. It would work, but best practice is to do it differently.

The common approach is to write the entire application so it runs on each machine, but the application is aware that it is not alone. And some part of the application has a node monitor so it is aware of node-downs (this is done with simple network ping). These node downs can be used to change load balancing rules or fall over to another master, etc.

This ping means that there is latency in detecting node-downs. It can take quite a few seconds to detect a dead peer node (or dead link to it).

If the supervisor and process runs locally, the crash and the signal to the supervisor is pretty much instantanious. It relies on a feature that an abnormal crash propagates to linked processes that crash as well unless they trap exits.

Christian
Thanks, that makes a lot of sense. It seems a common thing that sending messages between machines is different than sending between local processes (greater overheads, more reasons it can fail, etc). So code your application to know about this (there is no silver bullet to make local/remote calls the same, so don't try).This means a similar model in the JVM is certainly possible. Only supervise local processes/threads/fibres/actors/whatever, and code into your application pinging of other nodes (and what to do if you cannot reach one).
Alan Kent
A: 

Erlang is opensource, which means you can download the source and get the definitive answer on how Erlang does it.

How does Erlang do this monitoring, especially in a distributed scenario? How can it be sure the process has really died? Does it do heart beats? Is something built into the runtime environment?

I believe it's done in the BEAM runtime. When a process dies a signal is sent to all processes linked to it. See Chapter 9 of Programming Erlang for a full discussion.

What if a network cable is unplugged - does it assume the other processes have died if it cannot communicate with them? etc.

In Erlang, you can choose to monitor a node, and receive {node_up, Node} and {node_down, Node} messages. I assume these will also be sent if you can no longer talk to a node. How you handle them is up to you.

pgs