views:

29

answers:

1

There is some cluster and there is some unix network daemon. This daemon is started on each cluster node, but only one can be active.

When active daemon breaks (whether program breaks of node breaks), other node should become active.

I could think of few possible algorithms, but I think there is some already done research on this and some ready-to-go algorithms? Am I right? Can you point me to the answer?

Thanks.

A: 

Jgroups is a Java network stack which includes DistributedLockManager type of support and cluster voting capabilities. These allow any number of unix daemons to agree on who should be active. All of the nodes could be trying to obtain a lock (for example) and only one will succeed until the application or the node fails.

Jgroups also have the concept of the coordinator of a specific communication channel. Only one node can be coordinator at one time and when a node fails, another node becomes coordinator. It is simple to test to see if you are the coordinator in which case you would be active.

See: http://www.jgroups.org/javadoc/org/jgroups/blocks/DistributedLockManager.html

If you are going to implement this yourself there is a bunch of stuff to keep in mind:

  • Each node needs to have a consistent view of the cluster.
  • All nodes will need to inform all of the rest of the nodes that they are online -- maybe with multicast.
  • Nodes that go offline (because of ap or node failure) will need to be removed from all other nodes' "view".
  • You can then have the node with the lowest IP or something be the active node.
  • If this isn't appropriate then you will need to have some sort of voting exchange so the nodes can agree who is active. Something like: http://en.wikipedia.org/wiki/Two-phase_commit_protocol
Gray