I like the LWN article "Crash-only software" and I would like to learn more about crash-safe and fault-tolerant programming.
It is surprisingly hard to assure that the persistent state is consistent in fault situations. Here I do not even talk about distributed operations: That is hard on a single node, too: Even the normal Berkeley DB (BDB Data Store or BDB Concurrent Data Store) might have a destroyed database if the system crashes. Not only that high level application constraints are broken, the database might not be opened correctly if the system crashes.
What are good resources about crash-safe and fault-tolerant designs, approaches, and programming.
If the resources focus on C++ and POSIX environments, I would appreciate that.