views:

266

answers:

3

Suppose you have n processes, n > 2. You want to have agreement amongst them that one is to be active. So they need to vote amonst each other to determine which one is active.

All processes may fail at any time, we want to have one process active if possible, but ...

We must never have two active at the same time, so if they can't be sure it is better to have no-one active. (Ie. we want to avoid split brain)

The only available communication mechanism between them is pub-sub messaging (not point to point).

One or more databases are available, but no one database should be a single point of failure. Ie. it would be very undesirabloe if all the processes were available to work, and the were prevented from doing so by the loss of a single database.

Design? What messages need to be published?

+1  A: 

I'm not clear on pub-sub messaging.

If they're getting some sort of work objects from an outside source and you only want one of them to process the work, you could take a hash value space, 2^64, divide up the space by the number of nodes each node taking a chunk. Each node could hash the work objects as they come in and determine if it's theirs.

pub-sb messaging, implying that each member can broadcast information but can't be sure that the other members see it.Anyway, your answer is actually one we had thought of, and really liked. The downside is that we actually only have 2 nodes, and so in degraded mode lose half the work.
djna
+10  A: 

Theory:

This is leader election, which is a form of the Consensus Problem, also sometimes called The Two Generals Problem. Under some sets of assumptions (fully async and messages can be lost) it's been proven impossible, and the proof is particularly elegant.

The intuition of this problem is: imagine some algorithm exists that allows consensus to be reached in some fixed number of messages. Since failures are tolerated, we can drop one message from the protocol, and it should still work. We can repeat this process until there are no messages at all, clearly an impossibility.

In practice we overcome this using failure detectors to simulate a synchronous system.

The most widely known algorithm that solves consensus is Paxos, which can tolerate failure of up to half of the participating nodes. Paxos has the reputation of being very difficult to implement as even slight misunderstandings of the details of the protocol destroy it's correctness.

Practical solutions:

While the problem in general is quite difficult, getting working systems up is far easier. There are off the shelf implementations of Paxos or equivalent algorithms available. Apache Zookeeper is the best I'm aware of. For your specific problem, I'm pretty sure it'll be your quickest route. Other Paxos implementations are around, and it also might be possible to build something on network redundancy virtual ip tools like Wackamole. I believe the high end versions of most commercial databases offer quorum features as an (expensive) option.

Also, for many applications it's acceptable to weaken correctness slightly or otherwise adjust the problem to allow much simpler solutions.

For example, if a single point of failure is tolerable because recovery is likely to be quick, then the problem is trivial: just have one special node do the work.

Another approach might be to build the system around idempotent actions, so duplicate processing becomes tolerable.

Lastly you might partition the workload into a pool of non-redundant systems: here failures will delay processing until recovery but only for items at that node, not for the entire workload.

These sorts of compromises are so much simpler that they're often a better choice. One has to weigh the utility of a full solution against the complexity of implementing it and see if there's really value. This is why so many practical systems just use 2 Phase or 3 Phase Commit, even though they block in some scenarios: the decreased availability is tolerable compared to the complexity of a full quorum system.

Jason Watkins
Thank you for this detailed explanation, and the references. I agree that relaxing the constraints to enable a simpler solution is an avenue we must explore. [I can now echo Spike Milligan's epitaph when I go back to the team "See, I told you it was hard!"].
djna
A: 

Take a look at how routing protocols (OSPF and IS-IS) do it, and see if that works for you. They elect a leader (and in the OSPF case, a backup leader).

Thomas
It's the election process itself that was bothering us. Thanks for the suggestion. This http://routergod.com/sevenofnine/ospf_part_2.html article outlines what's going on, but doesn't give much detail. The problem for me being how to deal with unreliable communitcations and avoiding the possibility of split brain.
djna
That link only explains a brief overview. Take a look at some cisco docs or the book "Routing TCP/IP".
Thomas