views:

162

answers:

2

I like the LWN article "Crash-only software" and I would like to learn more about crash-safe and fault-tolerant programming.

It is surprisingly hard to assure that the persistent state is consistent in fault situations. Here I do not even talk about distributed operations: That is hard on a single node, too: Even the normal Berkeley DB (BDB Data Store or BDB Concurrent Data Store) might have a destroyed database if the system crashes. Not only that high level application constraints are broken, the database might not be opened correctly if the system crashes.

What are good resources about crash-safe and fault-tolerant designs, approaches, and programming.

If the resources focus on C++ and POSIX environments, I would appreciate that.

+2  A: 

The Aktor model in languages Erlang and Scala the let it crash model. See this article.

TTMAN
+4  A: 

Akka is a framework for Java and Scala that is written with let-it-crash in mind. See this article and this presentation for an introduction to Actors and let-it-crash. It is also called Fail-Fast and worker/supervisor style.

Two good presentations on erlang is Systems that Never Stop (and Erlang) and Message Passing Concurrency in Erlang

Theron is a actor library for C++, I also think there is something in Boost also.

Also Erlang can call C or C++ code see this for a discussion. Java / Scala / Akka can also call C++ code.

(If you like C++ I suggest you to have a look at Scala, very nice language and better than Java if you come from C++.)

Also Jonas Boners presentation Scalability, Availability & Stability Patterns is a good presentation on the topic.

oluies
If you let Java (or Scala) call a C++ DLL by using JNI, then the stability of the JVM is endangered. Since the C++ code runs in the same process as the JVM, the JVM will die if you get a crash in the C++ code. JNI does not work very well, do not use it.
olle kullberg