ansaurus

Question

How to monitor a remote erlang node which was down and is restarting

Answer 1

+1 A:

But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?

Just an idea, but how about having the restarting node itself explicitly inform the supervisor/monitoring node that it has finished restarting and that it is available again?

You could use a recurring "heartbeat message" for this purpose, or come up with a custom message specifically meant to be sent once after successful initialization. Something along the lines of:

start(SupervisorPID) ->
  SuperVisorPID ! {hello, MyPID};
  mainloop().

none 2009-06-12 00:04:44

Yes - this is actually what the node does when it restarts - there isn't actually any supervisor node per se, they are effectively "buddies" and the node reaches out to its buddies to determine the state of the system (and perhaps copying that state) when it starts up.

Alan Moore 2009-06-12 01:09:11

And to be clear - each node is equal so what do you do if you're the first node up - you can't rpc to any node, and you don't have (of course) any PIDs to send messages to. BUT it could simply hang around and wait for any other nodes to start up and contact it...

Alan Moore 2009-06-12 01:12:51

It seems your nodes are basically decentralized without any form of central node? I assume that normally, you'll want to have at least one supervisor or at least some for of 'master' node. That all nodes can send their reports/messages to, i.e. some form of "node registry". Maybe you need to provide some more info, I could imagine one could also think about having by convention each node become a master if there is no master already. That would satisfy the equality requirement.

none 2009-06-12 01:30:15

I would recommend having some sort of master node. That way you decrease the amount of messaging required. If the master node goes down you could always hold an election of some sort. There is a reason OTP systems use a supervisor hierarchy for node management. Even the process pool module uses a master node to manage the pool.

Jeremy Wall 2009-06-12 01:55:13

Within each node I do have a whole supervisor hierarchy - and that works well. I've tried to steer clear of having one particular node being the "master" as I'm trying to make it quite fault tolerant and any client of this application can connect to any of the available nodes to get an equivalent service. The election idea is a good one though - if all nodes know who's the current master and that node dies they can decide amongst themselves who's the new master...

Alan Moore 2009-06-12 05:32:42

ansaurus

tags:

views:

answers:

How to monitor a remote erlang node which was down and is restarting

related questions