tags:

views:

491

answers:

1

My application runs in an erlang cluster - with usually two or more nodes. There's active monitoring between the nodes (using erlang:monitor_node) which works fine - I can detect and react to the fact that a node that was up is now down.

But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?

(Edited to add)

I think the answer to perform a technique like election of a supervisor is the thought process I was missing. I'll look into that and mark this question as done....

+1  A: 

But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?

Just an idea, but how about having the restarting node itself explicitly inform the supervisor/monitoring node that it has finished restarting and that it is available again?

You could use a recurring "heartbeat message" for this purpose, or come up with a custom message specifically meant to be sent once after successful initialization. Something along the lines of:

start(SupervisorPID) ->
  SuperVisorPID ! {hello, MyPID};
  mainloop().
none
Yes - this is actually what the node does when it restarts - there isn't actually any supervisor node per se, they are effectively "buddies" and the node reaches out to its buddies to determine the state of the system (and perhaps copying that state) when it starts up.
Alan Moore
And to be clear - each node is equal so what do you do if you're the first node up - you can't rpc to any node, and you don't have (of course) any PIDs to send messages to. BUT it could simply hang around and wait for any other nodes to start up and contact it...
Alan Moore
It seems your nodes are basically decentralized without any form of central node? I assume that normally, you'll want to have at least one supervisor or at least some for of 'master' node. That all nodes can send their reports/messages to, i.e. some form of "node registry". Maybe you need to provide some more info, I could imagine one could also think about having by convention each node become a master if there is no master already. That would satisfy the equality requirement.
none
I would recommend having some sort of master node. That way you decrease the amount of messaging required. If the master node goes down you could always hold an election of some sort. There is a reason OTP systems use a supervisor hierarchy for node management. Even the process pool module uses a master node to manage the pool.
Jeremy Wall
Within each node I do have a whole supervisor hierarchy - and that works well. I've tried to steer clear of having one particular node being the "master" as I'm trying to make it quite fault tolerant and any client of this application can connect to any of the available nodes to get an equivalent service. The election idea is a good one though - if all nodes know who's the current master and that node dies they can decide amongst themselves who's the new master...
Alan Moore