ansaurus

Question

What algorithms there are for failover in a distributed system?

Answer 1

A:

Tackling just a small part of your question: there's no way in the scenario you describe to decide (in the abstract) which node(s) have the latest secondary copy. At best, some node can poll and determine (after a bit of communication) who among the nodes that they know of / can see, and that know of / can see them, and that can't see the old master has the most current copy. But:

They can't find out the status of nodes they can't reach
They can't find out the status of nodes that can't reach them
They can't be sure that what they think they know about the status of a node that can see the old master when they can't is current--the master could have updated the shared neighbor after the neighbor reported status.

On the broader issues, you may want to look at how something like memcached and the like handle the issues, and especially read through the lists to see what problems they've encountered when theory met practice.

MarkusQ 2009-03-06 23:37:31

memcached does not handle these issues, it assumes that the client will do so. Moreover, it is just a caching system therefore if it fails (or returns not found), the calling process can request the data directly from source.

MarkR 2009-04-28 12:50:55

Answer 2

+6 A:

You are asking an absolutely massive question, and a lot of what you want to know is still in active research.

Some thoughts:

Distributed systems are difficult, because there are no foolproof systems to deal with failures; in an asynchronous system, there is no way to be sure that a node is down or whether there is network delay. This may sound trivial, but it really isn't.
Achieving consensus can be done by the Paxos family of algorithms, versions of which are used in Google's bigtable, and in other places.

You'll want to delve into a distributed systems textbook (or several). I like Tannenbaum's Distributed Systems: Principles and Paradigms

Rob Lachlan 2009-03-06 23:48:57

"You are asking an absolutely massive question, and a lot of what you want to know is still in active research." -- That's what makes this (hobby) project so interesting. It's no boring generic CRUD web application. Challenges are fun. :)

Esko Luontola 2009-03-07 00:15:10

No, absolutely you're right. It's an excellent idea for a project, and I'm sure you'll get a lot out of it. Definitely start reading soome textbooks. And read lamport's paper on Paxos. Very readable.

Rob Lachlan 2009-03-07 00:16:57

Answer 3

+1 A:

This problem was solved by DEC for VMS with the Distributed Lock Manager. Modern solutions are based on this design. Read the Wikipedia article for some current solutions. You should look at OCFS2, which is now part of the Linux kernel.

Jared Oberhaus 2009-03-06 23:50:23

Answer 4

+2 A:

A great blog that talks a lot about distributed systems and distributed algorithms -- including implementing Paxos -- is http://hnr.dnsalias.net/wordpress/.

Kevin Weil 2009-03-08 18:57:19

Answer 5

A:

I don't know but when you are done I want to download your distributed database system.

tuinstoel 2009-03-08 19:01:19

The project page is http://dimdwarf.sourceforge.net/ - the distributed part is at the end of the roadmap using the name "Dimdwarf-HA". It will take many months to get there, probably more than a year.

Esko Luontola 2009-03-08 22:33:41

Well good luck!!

tuinstoel 2009-03-09 10:06:48

Answer 6

+7 A:

* What algorithms there are for doing failover in a distributed system?

Possibly not algorithms, so much as systems. You need to design your architecture around the questions you've asked.

* What algorithms there are for consensus in a distributed system?

You probably want to implement Paxos. Simple Paxos is not too hard to get right. If you're are trying to make it bullet proof, read Google's 'Paxos Made Live' paper. If you're hoping to make it high-performance, look at Multi-Paxos.

* How should the nodes in the cluster determine that a node is down?

Depends. Heartbeats are actually a pretty good way to do this. The problem is that you have false positives, but that's kind of unavoidable, and in a cluster on the same LAN with manageable load they're accurate. The good thing about Paxos is that false positives are dealt with automatically. However, if you actually need failure information for some other purpose then you need to make sure it's ok that you detect a node as failed, but it actually is just under load and taking time to respond to a heartbeat.

* How should the nodes determine that what database entries had their master copy on the failed node at the time of failure, so that other nodes may recover those entries?
* How to decide that which node(s) has the latest secondary copy of some entry?
* How to decide that which node's secondary copy should be promoted to be the new master copy?

I think you might really benefit from reading the Google FileSystem paper. In GFS there's a dedicated master node which keeps track of which nodes have which blocks. This scheme might work for you, but the key is to keep accesses to this master minimal.

If you don't store this information on a dedicated node, you're going to have to store it everywhere. Try tagging the data with the master holder's id.

* How to handle it, if the node which was though to be down, suddenly comes back as if nothing happened?

See above, but the basic point is that you have to be careful because a node that is no longer the master might think that it is. One thing that I don't think you've solved: how does an update get to the master - i.e. how does a client know which node to send the update to?

* How to avoid split-brain scenarios, where the network is temporarily split into two, and both sides think that the other side has died?

Paxos works here by preventing progress in the case of a perfect split. Otherwise, as before, you have to be very careful.

In general, solve the question of knowing which node gets which data item as the master, and you'll be a long way towards fixing your architecture. Note that you can't just have the node receiving the update be the master - what if two updates happen concurrently? Don't rely on a synchronised global clock either - that way madness lies. You probably want to avoid running consensus on every write if you can help it, so instead perhaps have a slow master-failover protocol and a fast write path.

Feel free to shoot me a mail off line if you want to know more details. My blog (http://hnr.dnsalias.net/wordpress) deals with a lot of this stuff.

cheers,

Henry

HenryR 2009-03-08 23:57:36

ansaurus

tags:

views:

answers:

What algorithms there are for failover in a distributed system?

related questions