When creating system services which must have a high reliability, I often end up writing the a lot of 'failsafe' mechanisms in case of things like: communications which are gone (for instance communication with the DB), what would happen if the power is lost and the service restarts.... how to pick up the pieces and continue in a correct way (and remembering that while picking up the pieces the power could go out again...), etc etc

I can imagine for not too complex systems, a language which would cater for this would be very practical. So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off.

Does this exist yet? If so, where can I find it? If not, why can't this be realized? It would seem to me very handy for critical systems.

p.s. In case the DB connection is lost, it would signal that a problem arose, and manual intervention is needed. The moment he connection is restored, it would continue where it left off.

EDIT: Since the discussion seems to have died off let me add a few points(while waiting before I can add a bounty to the question)

The Erlang response seems to be top rated right now. I'm aware of Erlang and have read the pragmatic book by Armstrong (the principal creator). It's all very nice (although functional languages make my head spin with all the recursion), but the 'fault tolerant' bit doesn't come automatically. Far from it. Erlang offers a lot of supervisors en other methodologies to supervise a process, and restart it if necessary. However, to properly make something which works with these structures, you need to be quite the erlang guru, and need to make your software fit all these frameworks. Also, if the power drops, the programmer too has to pick up the pieces and try to recover the next time the program restarts

What I'm searching is something far simpler:

Imagine a language (as simple as PHP for instance), where you can do things like do DB queries, act on it, perform file manipulations, perform folder manipulations, etc.

It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.

Last but not least, if the DB connection drops and can't be restored, the language just halts, and signals (syslog perhaps) for human intervention, and then carries on where it left off.

A language like this would make a lot of services programming a lot easier.

EDIT: It seems (judging by all the comments and answers) that such a system doesn't exist. And probably will not in the near foreseeable future due to it being (near?) impossible to get right.

Too bad.... again I'm not looking for this language (or framework) to get me to the moon, or use it to monitor someones heartrate. But for small periodic services/tasks which always end up having loads of code handling bordercases (powerfailure somewhere in the middle, connections dropping and not coming back up),...where a pause here,...fix the issues,....and continue where you left off approach would work well.

(or a checkpoint approach as one of the commenters pointed out (like in a videogame). Set a checkpoint.... and if the program dies, restart here the next time.)

Bounty awarded: At the last possible minute when everyone was coming to the conclusion it can't be done, Stephen C comes with napier88 which seems to have the attributes I was looking for. Although it is an experimental language, it does prove it can be done and it is a something which is worth investigating more.

I'll be looking at creating my own framework (with persistent state and snapshots perhaps) to add the features I'm looking for in .Net or another VM.

Everyone thanks for the input and the great insights.

+1  A: 

The closest approximation appears to be SQL. It's not really a language issue though; it's mostly a VM issue. I could imagine a Java VM with these properties; implementing it would be another matter.

A quick&dirty approximation is achieved by application checkpointing. You lose the "die at any moment" property, but it's pretty close.

It is the database behind SQL that provides transaction properties.
Ira Baxter
... all SQL does is signal start of Transaction and end of Transaction, and the DB handles reliable state change. That's pretty minimal support, at best.
Ira Baxter
The original question specifically asked about *language* support. All languages need implementation. SQL as a language does provide explicit support for transactions, which is way beyond what you'll get elsewhere.
You can add "start_transaction()" and "end_transaction()" calls to any language almost trivially. So the presence of these features doesn't mean the language supports fault tolerance in any interesting way. You're right, its the "implementation", but that isn't a specific property of SQL.
Ira Baxter
I await Ira's ACID implementation of C.
@Ira: Exactly my point; the database acts as the VM for SQL.
The fault tolerance on SQL Server comes mostly through replaying transaction logs, which are SQL statements.
Chris S
It appears we have a violent agreement but that you don't agree. Yes, you can add start/end transaction as C calls. Agreed, that doesn't mean C meaningfully supports transactions. That's my point about SQL, too; it too says "start/end", but that makes it no better than C. It isn't the *language* which is providing the fault tolerance, which is what the OP requested.
Ira Baxter
+48  A: 

Erlang was designed for use in Telecommunication systems, where high-rel is fundamental. I think they have standard methodology for building sets of communicating processes in which failures can be gracefully handled.

ERLANG is a concurrent functional language, well suited for distributed, highly concurrent and fault-tolerant software. An important part of Erlang is its support for failure recovery. Fault tolerance is provided by organising the processes of an ERLANG application into tree structures. In these structures, parent processes monitor failures of their children and are responsible for their restart.

Ira Baxter
+1 for beating me to suggesting Erlang
Jeff and Joel had an interesting talk to Damien Katz about his use of Erlang in building CouchDB is Stack Overflow Podcast #59
Dave Webb
I've looked at erlang in the past. However nice it's fault tolerant features it has... It doesn't survive a powercut. When the code restarts you'll have to pick up the pieces
@Nifle: +1 for nice summary of fault-tolerance methodology.
Ira Baxter
What counts in Erlang is the idea that threads can fail, and that a supervisory thread can hear about it and take steps to restart the work that the supervised thread was supposed to be doing. This is a key building block to constructing a fault tolerant system. In the case of power fail, the supervising thread better be in another, non-powerfailed machine, of course. If you insist on the possibility of powerfail of everything, then somewhere state has to be saved. See STM answer in this thread.
Ira Baxter
Yes, but in essence it is not disaster proof. If something fails, you still have to manualy make code which starts it up, and which has to 'recover' (clean up old files, etc etc)
So although erlang is voted highest (of course, everyone loves erlang (me too)) it is really not what I'm searching
I suspect the erlang votes might coincide with its appearance on the podcast with Damia(e)n from couchdb, although Ira clearly has lots of experience in the area
Chris S
+2  A: 

Try taking an existing open source interpreted language and see if you could adapt its implementation to include some of these features. Python's default C implementation embeds an internal lock (called the GIL, Global Interpreter Lock) that is used to "handle" concurrency among Python threads by taking turns every 'n' VM instructions. Perhaps you could hook into this same mechanism to checkpoint the code state.

Paul McGuire
nice idea
+9  A: 

The majority of such efforts - termed 'fault tolerance' - are around the hardware, not the software.

The extreme example of this is Tandem, whose 'nonstop' machines have complete redundancy.

Implementing fault tolerance at a hardware level is attractive because a software stack is typically made from components sourced from different providers - your high availability software application might be installed along side some decidedly shaky other applications and services on top of an operating system that is flaky and using hardware device drivers that are decidedly fragile..

But at a language level, almost all languages offer the facilities for proper error checking. However, even with RAII, exceptions, constraints and transactions, these code-paths are rarely tested correctly and rarely tested together in multiple-failure scenerios, and its usually in the error handling code that the bugs hide. So its more about programmer understanding, discipline and trade-offs than about the languages themselves.

Which brings us back to the fault tolerance at the hardware level. If you can avoid your database link failing, you can avoid exercising the dodgy error handling code in the applications.

Sun hardware and Solaris in particular
Chris S
+11  A: 

From what I know¹, Ada is often used in safety critical (failsafe) systems.

Ada was originally targeted at embedded and real-time systems.

Notable features of Ada include: strong typing, modularity mechanisms (packages), run-time checking, parallel processing (tasks), exception handling, and generics. Ada 95 added support for object-oriented programming, including dynamic dispatch.

Ada supports run-time checks in order to protect against access to unallocated memory, buffer overflow errors, off-by-one errors, array access errors, and other detectable bugs. These checks can be disabled in the interest of runtime efficiency, but can often be compiled efficiently. It also includes facilities to help program verification.

For these reasons, Ada is widely used in critical systems, where any anomaly might lead to very serious consequences, i.e., accidental death or injury. Examples of systems where Ada is used include avionics, weapon systems (including thermonuclear weapons), and spacecraft.

N-Version programming may also give you some helpful background reading.

¹That's basically one acquaintance who writes embedded safety critical software

Chris S
Ada is *used* for critical systems, but mostly because the additional checking helps to ensure that the programmers didn't code something stupid. Ada as a language doesn't provide *any* facilities that I know about that allow an Ada thread to die and recover in a safe way.
Ira Baxter
But can it correctly withstand the loss of power? Would be greatest if the software just continues where it left off when the power is back on
As I said, Ada provides no support for state storage/recovery. So *Ada* can't handle a power loss. Of course, the Ada application might be coded to use a transactional memory library (e.g., a special database etc) but then it isn't the language doing the work, which is what OP requested.
Ira Baxter
Ada is more about fault prevention instead of tolerance, however most safety critical applications require full fault tolerance. There's a tonne of papers on ADA and ft out there, my knowledge is 3rd hand however
Chris S and may help @reiner
Chris S
The Ariane 5 disaster was due to a bug in Ada code. Nothing is truly "disaster proof."
Ken Liu
The Ariane 5 disaster was due to stupid project management. They decided that some certain code worked fine on Ariane 4 (which had a much lower initial acceleration), so they wouldn't even look at it before using it on the Ariane 5. No language is immune to specs and therefore inputs changing out from under it.
David Thornley
+2  A: 

For a program to continue where it left off if the machine loses power, not only would it need to save state to somewhere, the OS would also have to "know" to resume it.

I suppose implementing a "hibernate" feature in a language could be done, but having that happen constantly in the background so it's ready in the event anything bad happens sounds like the OS' job, in my opinion.

well in my example, just imagine that the code is run as a service. That wya if the machine reboots, the service is restarted automatically. In an ideal world, this disasterproof language would just continue at the exact spot it saved it state the last time.
ok - but doesn't that still involve saving state and acting as it's own "mini-OS" for local hibernation? My question is how much storage overhead will this take on top of what the OS is already doing (because if the OS is also saving state, then you're double-storing everything)
no the program just saves state where you are: ok I'm now at line 50, the variables I know are this, that and such. Let's continue. I know, very oversimplified, but I'm not seeing why this shouldnt be possible (and therefore I hoped it already exists somewhere)
@reiner You just can't assume that it's ok to continue in all cases. The state of the underlying data in the database may have been altered by another process in the time it took for the "first" process to resume from its halted state.
cynthia: but this could also have been the case if the program executed really slowly. I'm not asking for atomicity, just for continuing where the program left off.
Your proposed solution makes too many assumptions to be generalizable enough to make a language out of. Many people have proposed hardware solutions which will handle server power cycles and such with aplomb but languages are too abstracted from the metal to deal with these sorts of things by themselves.
+2  A: 

It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.

... ...

I've looked at erlang in the past. However nice it's fault tolerant features it has... It doesn't survive a powercut. When the code restarts you'll have to pick up the pieces

If such a technology existed, I'd be VERY interested in reading about it. That said, The Erlang solution would be having multiple nodes--ideally in different locations--so that if one location went down, the other nodes could pick up the slack. If all of your nodes were in the same location and on the same power source (not a very good idea for distributed systems), then you'd be out of luck as you mentioned in a comment follow-up.

Dr. Watson
still, even with the erlang solution, the other node will still have to magically resume where the other left off. However in erlang it doesn't work this way. The other node just gets a message that a node has died, and this node than has to restart the first node. So You're still left with loads of code which has to restart the process (in stead of resume at the exact spot)
Unless the memory underneath the power-failed process in nonvolative, you can't restart at the exact spot; you don't have it after the power fail. In this case, the best you can hope for is to restart a process at some place where its state has been saved, e.g., the beginning of a transaction. See my answer on STM in this thread.
Ira Baxter
Unless the language caters for this and nonvolatile memory doesn't exist(for instance it store everything on disk)
@reiner I think you hit it with your first comment with the word 'magically'. You are looking for the proverbial silver bullet. Real fault tolerance happens in overlapping layers and is far from perfect. That's why highly fault tolerant systems are so expensive and why they are __systems__.
+2  A: 

The Microsoft Robotics Group has introduced a set of libraries that appear to be applicable to your question.

What is Concurrency and Coordination Runtime (CCR)?

Concurrency and Coordination Runtime (CCR) provides a highly concurrent programming model based on message-passing with powerful orchestration primitives enabling coordination of data and work without the use of manual threading, locks, semaphores, etc. CCR addresses the need of multi-core and concurrent applications by providing a programming model that facilitates managing asynchronous operations, dealing with concurrency, exploiting parallel hardware and handling partial failure.

What is Decentralized Software Services (DSS)?

Decentralized Software Services (DSS) provides a lightweight, state-oriented service model that combines representational state transfer (REST) with a formalized composition and event notification architecture enabling a system-level approach to building applications. In DSS, services are exposed as resources which are accessible both programmatically and for UI manipulation. By integrating service composition, structured state manipulation, and event notification with data isolation, DSS provides a uniform model for writing highly observable, loosely coupled applications running on a single node or across the network.

Most of the answers given are general purpose languages. You may want to look into more specialized languages that are used in embedded devices. The robot is a good example to think about. What would you want and/or expect a robot to do when it recovered from a power failure?

John Asbeck
I know CCR quit well, and it is a nice library to offer a simpler way to deal with concurrency. It has however not much to do with offering a way to make your program disasterproof.
+2  A: 

In the embedded world, this can be implemented through a watchdog interrupt and a battery-backed RAM. I've written such myself.

Paul Nathan
Nice, but this only detects your program has stopped for some reason or the other. Your software will still have to figure out where it left off, remove erroneous parts, and continue.
Well, actually the program counter is part of what's stored. :-) It might have to do some initial electrical configuration of the ports. Essentially you need an OS function, not a language function.
Paul Nathan
+25  A: 

Software Transactional Memory (STM) combined with nonvolatile RAM would probably satisfy the OP's revised question.

STM is a technique for implementating "transactions", e.g., sets of actions that are done effectively as an atomic operation, or not at all. Normally the purpose of STM is to enable highly parallel programs to interact over shared resources in a way which is easier to understand than traditional lock-that-resource programming, and has arguably lower overhead by virtue of having a highly optimistic lock-free style of programming.

The fundamental idea is simple: all reads and writes inside a "transaction" block are recorded (somehow!); if any two threads conflict on the these sets (read-write or write-write conflicts) at the end of either of their transactions, one is chosen as the winner and proceeds, and the other is forced to roll back his state to the beginning of the transaction and re-execute.

If one insisted that all computations were transactions, and the state at the beginning(/end) of each transaction was stored in nonvolatile RAM (NVRAM), a power fail could be treated as a transaction failure resulting in a "rollback". Computations would proceed only from transacted states in a reliable way. NVRAM these days can be implemented with Flash memory or with battery backup. One might need a LOT of NVRAM, as programs have a lot of state (see minicomputer story at end). Alternatively, committed state changes could be written to log files that were written to disk; this is the standard method used by most databases and by reliable filesystems.

The current question with STM is, how expensive is it to keep track of the potential transaction conflicts? If implementing STM slows the machine down by an appreciable amount, people will live with existing slightly unreliable schemes rather than give up that performance. So far the story isn't good, but then the research is early.

People haven't generally designed languages for STM; for research purposes, they've mostly enhanced Java with STM (see Communications of ACM article in June? of this year). I hear MS has an experimental version of C#. Intel has an experimental version for C and C++. THe wikipedia page has a long list. And the functional programming guys are, as usual, claiming that the side-effect free property of functional programs makes STM relatively trivial to implement in functional languages.

If I recall correctly, back in the 70s there was considerable early work in distributed operating systems, in which processes (code+state) could travel trivally from machine to machine. I believe several such systems explicitly allowed node failure, and could restart a process in a failed node from save state in another node. Early key work was on the Distributed Computing System by Dave Farber. Because designing languages back in the 70s was popular, I recall DCS had it had its own programming language but I don't remember the name. If DCS didn't allow node failure and restart, I'm fairly sure the follow on research systems did.

EDIT: A 1996 system which appears on first glance to have the properties you desire is documented here. Its concept of atomic transactions is consistent with the ideas behind STM. (Goes to prove there isn't a lot new under the sun).

A side note: Back in in 70s, Core Memory was still king. Core, being magnetic, was nonvolatile across power fails, and many minicomputers (and I'm sure the mainframes) had power fail interrupts that notified the software some milliseconds ahead of loss of power. Using that, one could easily store the register state of the machine and shut it down completely. When power was restored, control would return to a state-restoring point, and the software could proceed. Many programs could thus survive power blinks and reliably restart. I personally built a time-sharing system on a Data General Nova minicomputer; you could actually have it running 16 teletypes full blast, take a power hit, and come back up and restart all the teletypes as if nothing happened. The change from cacophony to silence and back was stunning, I know, I had to repeat it many times to debug the power-failure management code, and it of course made great demo (yank the plug, deathly silence, plug back in...). The name of the language that did this, was of course Assembler :-}

Ira Baxter
very interesting... let me read more on the subject and topics you mentioned. A hardware/software solution I'm not interested in... it should be software only. And if it is a fctor X slower, I don't care. Computers are really fast anyway
OK, then what you want is something like STM with log files being written to disk. On powerup, the log files get reprocessed to put the state into some condition consistent with the transactions, and to restart the computations that were running at the moment of failure.
Ira Baxter
Exactly. I'm thinking of following up one of the commenters to add it into a VM myself. Lua looks like a good candidate. Still, strange no-one thought of this earlier. I think it would be very useful (for not too complex 'scripts' or programs which need to fulfill one or more tasks, but need to fulfill it from a-z, in a really trustworthy manner.
I'll swear I saw a commercial product with this idea a few years ago from a little company in New Zealand. I dug around the web trying to recover the name of this place but didn't succeed. Frankly, I think there are very few new ideas; people are really smart. There's very few commercially succesful new ideas, though.Classically what goes wrong with new languages is, "Wow, that's neat. But I can't retrain my assembler/COBOL/FORTRAN/C/Javva/C# programmers so I'll keep looking for your widget in my language".(I have my own language design experience with a parallel programming language).
Ira Baxter
I understand what you are saying. I would be happy with a 'widget', framework or extension to C# as well, if it would add this functionality. ;^)
So STM is a hardware solution to transactions in the language level with checkpoints, commits and rollbacks... like SQL, with it's built-in transactions then right? ;-)
STM is a pure "Software" solution, with commits and rollbacks. There are hardware versions which attempt to lower the overhead of STM; the Rock processor that Sun will now famously not produce was an example of a hardware supported transaction system. If your point is that SQL doesn't have the transactionily built into the language, I'll agree; most of the machinery is behind the scenes. Howver, unlike SQL which simply makes DB calls to "start/end transaction", a compiler for STM does a huge amount of work to implement. So the compiler for the langauge really does implement transactions.
Ira Baxter
@reiner: while MS has some experimental STM support for C#, I don't think it handles power failures. You personally still have some work to do to achieve your goal :-{
Ira Baxter
I read about the stm support for c#. Looks nice though for the optimistic concurrency approach. I can imagine that this would be very fast. I'll sure be using that someday. Maybe in combination with the CCR (mentioned in a different reply).
I know that in LUA it's fairly trivial to access the program variables and to save state. With some modifications to the VM I could get a long way. I would prefer to build it into c# though, but getting to the VM level is probably out of the question. I was thinking about a software framework which would add stateful variables in C# but I'm not close to a working solution yet ;^)
For C# this might help though you'd be checkpointing often and making the app grind
Chris S
excellent advice. Looks like a tricky msdn article they refer too though. Anyhow, I'll be sure to read it. Thanks!
+13  A: 

I doubt that the language features you are describing are possible to achieve.

And the reason for that is that it would be very hard to define common and general failure modes and how to recover from them. Think for a second about your sample application - some website with some logic and database access. And lets say we have a language that can detect power shutdown and subsequent restart, and somehow recover from it. The problem is that it is impossible to know for the language how to recover.

Let's say your app is an online blog application. In that case it might be enough to just continue from the point we failed and all be ok. However consider similar scenario for an online bank. Suddenly it's no longer smart to just continue from the same point. For example if I was trying to withdraw some money from my account, and the computer died right after the checks but before it performed the withdrawal, and it then goes back one week later it will give me the money even though my account is in the negative now.

In other words, there is no single correct recovery strategy, so this is not something that can be implemented into the language. What language can do is to tell you when something bad happens - but most languages already support that with exception handling mechanisms. The rest is up to application designers to think about.

There are a lot of technologies that allow designing fault tolerant applications. Database transactions, durable message queues, clustering, hardware hot swapping and so on and on. But it all depends on concrete requirements and how much the end user is willing to pay for it all.

Gregory Mostizky
For a lot of occasions it would surely help a lot. Of course maybe for a bank you'd need something better/intelligent, but there are lots of scripts/services which would really be much simpeler if recovery mechanisms are built into the language. Even if it would just be to have restorepoints (like in a video game). If you die, you continue from the restore point. In a program this could be identical
+9  A: 

No, a disaster-proof language does not exist.


Disaster-proof implies perfection. It brings to mind images of a process which applies some intelligence to resolve unknown, unspecified and unexpected conditions in a logical manner. There is no manner by which a programming language can do this. If you, as the programmer, can not figure out how your program is going to fail and how to recover from it then your program isn't going to be able to do so either.

Disaster from an IT perspective can arise in so many fashions that no one process can resolve all of those different issues. The idea that you could design a language to address all of the ways in which something could go wrong is just incorrect. Due to the abstraction from the hardware many problems don't even make much sense to address with a programming language; yet they are still 'disasters'.

Of course, once you start limiting the scope of the problem; then we can begin talking about developing a solution to it. So, when we stop talking about being disaster-proof and start speaking about recovering from unexpected power surges it becomes much easier to develop a programming language to address that concern even when, perhaps, it doesn't make much sense to handle that issue at such a high level of the stack. However, I will venture a prediction that once you scope this down to realistic implementations it becomes uninteresting as a language since it has become so specific. i.e. Use my scripting language to run batch processes overnight that will recover from unexpected power surges and lost network connections (with some human assistance); this is not a compelling business case to my mind.

Please don't misunderstand me. There are some excellent suggestions within this thread but to my mind they do not rise to anything even remotely approaching disaster-proof.

Hahaha... Seriously though, might it be, that your wisdom (like mine) is not infinite and someone else might still come up with one? This seems reasonable, which is why I don't understand the upvote of the answer. ;^)
I'm not joking; it's not reasonable. It is one thing to look for a language that promotes recoverability and reliability. (see Erlang answers) But it's completely another to look for a language that just does it for you. Life is too complex for stuff like this to work. What is appropriate for a medical imaging device and a air-traffic tower and amazon's database are all completely different. The failure modes and the ways to recover are all different; so there is no one language that can do it all. Even in simple programs the error-handling routines are complex.
For example, I maintain a Point-of-Sale system for my company and the number of ways that thing can fail is incredible; and that's just a POS. No one dies when their sale doesn't go through.
true... I can imagine my language may not be the correct solution for all areas (especially ones with high complexity, realtime characteristics, etc). Yet for simple almost script wised things I could see it working rather well. But maybe I'm just to optimistic
By the time you reduce the scope enough to make this work, you won't have a generalizable solution anymore. (It won't be a language.) You will simply have a well-designed program for your application. At best, you will have a nice tool that can be used by many programmers to perform a common but necessary task. That's not bad; but it is also not a disaster-proof language.
I'm no language designer, so my solution will always be highly tailored to my situation. I was hoping academic minds would have solved this problem ages ago.
It would be nice to add more explanation about why this is the case.
I agree with Cynthia and have provided an answer below that references said "academic minds". Hopefully, you'll find solace there.
+4  A: 

Consider a system built from non-volatile memory. The program state is persisted at all times, and should the processor stop for any length of time, it will resume at the point it left when it restarts. Therefore, your program is 'disaster proof' to the extent that it can survive a power failure.

This is entirely possible, as other posts have outlined when talking about Software Transactional Memory, and 'fault tolerance' etc. Curious nobody mentioned 'memristors', as they would offer a future architecture with these properties and one that is perhaps not completely von Neumann architecture too.

Now imagine a system built from two such discrete systems - for a straightforward illustration, one is a database server and the other an application server for an online banking website.

Should one pause, what does the other do? How does it handle the sudden unavailability of it's co-worker?

It could be handled at the language level, but that would mean lots of error handling and such, and that's tricky code to get right. That's pretty much no better than where we are today, where machines are not check-pointed but the languages try and detect problems and ask the programmer to deal with them.

It could pause too - at the hardware level they could be tied together, such that from a power perspective they are one system. But that's hardly a good idea; better availability would come from a fault-tolerant architecture with backup systems and such.

Or we could use persistant message queues between the two machines. However, at some point these messages get processed, and they could at that point be too old! Only application logic can really work what to do in that circumstances, and there we are back to languages delegating to the programmer again.

So it seems that the disaster-proofing is better in the current form - uninterrupted power supplies, hot backup servers ready to go, multiple network routes between hosts, etc. And then we only have to hope that our software is bug-free!

true... I can see problems for things like banking applications. But for 'simple' tasks which have to process csv files being delivered on a ftp site, perform hourly maintenance on files or db, etc etc. In that case it would be nice it it just would continue where it left off.
Anyhow... you touch some interesting points. It is definitely not easy to create such a language/system/framework.
would the simple tasks really need a much beyond a UPS and a 'retry until succeeding' logic in the upload script?
I bump into a lot of these cases. Going out of my way to have temp files and such to remember where I was in case of unforeseen problems. I also have code then which, by reading that file, can deduce where I was and what stuff has to be cleaned/continued
+2  A: 

Depending upon your definition of a disaster, it can range from 'difficult' to 'practicly impossible' to delegate this responsibility to the language.

Other examples given include persisting the current state of the application to NVRAM after each statement is executed. This only works so long as the computer doesn't get destroyed.

How would a language level feature know to restart the application on a new host?

And in the situation of restoring the application to a host - what if significant time had passed and assumptions/checks made previously were now invalid?

T-SQL, PL/SQL and other transactional languages are probably as close as you'll get to 'disaster proof' - they either succeed (and the data is saved), or they don't. Excluding disabling transactional isolation, it's difficult (but probably not impossible if you really try hard) to get into 'unknown' states.

You can use techniques like SQL Mirroring to ensure that writes are saved in atleast two locations concurrently before a transaction is committed.

You still need to ensure you save your state every time it's safe (commit).

Will Hughes
+1  A: 

Precise answer:

Ada and SPARK were designed for maximum fault-tolerance and to move all bugs possible to compile-time rather than runtime. Ada was designed by the US Dept of Defense for military and aviation systems, running on embedded devices in such things as airplanes. Spark is its descendant. There's another language used in the early US space program, HAL/S geared to handling HARDWARE failure and memory corruption due to cosmic rays.

Practical answer:

I've never met anyone who can code Ada/Spark. For most users the best answer is SQL variants on a DBMS with automatic failover and clustering of servers. Integrity checks guarantee safety. Something like T-SQL or PL/SQL has full transactional security, is Turing-complete, and is pretty tolerant of problems.

Reason there isn't a better answer:

For performance reasons, you can't provide durability for every program operation. If you did, the processing would slow to the speed of your fastest nonvolative storage. At best, your performance will drop by a thousand or million fold, because of how much slower ANYTHING is than CPU caches or RAM.

It would be the equivalent of going from a Core 2 Duo CPU to the ancient 8086 CPU -- at most you could do a couple hundred operations per second. Except, this would be even SLOWER.

In cases where frequent power cycling or hardware failures exist, you use something like a DBMS, which guarantees ACID for every important operation. Or, you use hardware that has fast, nonvolatile storage (flash, for example) -- this is still much slower, but if the processing is simple, this is OK.

At best your language gives you good compile-time safety checks for bugs, and will throw exceptions rather than crashing. Exception handling is a feature of half the languages in use now.

I would be the first to create script like programs in t-sql. It's all very nice for DB thingies, but for anything else it is worthless. I'm looking more for a general purpose language which also has DB support
Separate your durability from your logic -- you program logic should be in a language with exception handling and good compile-time checks. Major operations can be logged to the DB to provide integrity. Unfortunately, there's no better way to do it.
FastCompany ran an article a couple of years back on the programming for the space shuttle, and mentioned they're still using HAL/S ... but that it was the process, not the language, that was the key to preparing for problems:

I think its a fundemental mistake for recovery not to be a salient design issue. Punting responsibility exclusivly to the environment leads to a generally brittle solution intolerant of internal faults.

If it were me I would invest in reliable hardware AND design the software in a way that it was able to recover automatically from any possible condition. Per your example database session maintenance should be handled automatically by a sufficiently high level API. If you have to manually reconnect you are likely using the wrong API.

As others have pointed out procedure languages embedded in modern RDBMS systems are the best you are going to get without use of an exotic language.

VMs in general are designed for this sort of thing. You could use a VM vendors ( al) API to control periodic checkpointing within your application as appropriate.

VMWare in particular has a replay feature (Enhanced Execution Record) which records EVERYTHING and allows point in time playback. Obviously there is a massive performance hit with this approach but it would meet the requirements. I would just make sure your disk drives have a battery backed write cache.

You would most likely be able to find similiar solutions for java bytecode run inside a java virtual machine. Google fault tolerant JVM and virtual machine checkpointing.


If you do want the program information saved, where would you save it?

It would need to be saved e.g. to disk. But this wouldn't help you if the disk failed, so already it's not disaster-proof.

You are only going to get a certain level of granularity in your saved state. If you want something like tihs, then probably the best approach is to define your granularity level, in terms of what constitutes an atomic operation and save state to the database before each atomic operation. Then, you can restore to the point of that level atomic operation.

I don't know of any language that would do this automatically, sincethe cost of saving state to secondary storage is extremely high. Therefore, there is a tradeoff between level of granularity and efficiency, which would be hard to define in an arbitrary application.

Larry Watanabe
+2  A: 

If I understand your question correctly, I think that you are asking whether it's possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will complete (after any arbitrary number of recoveries/restarts).

If this is correct, then I would refer you to the halting problem:

Given a description of a program and a finite input, decide whether the program finishes running or will run forever, given that input.

I think that classifying your question as an instance of the halting problem is fair considering that you would ideally like the language to be "disaster proof" -- that is, imparting a "perfectness" to any flawed program or chaotic environment.

This classification reduces any combination of environment, language, and program down to "program and a finite input".

If you agree with me, then you'll be disappointed to read that the halting problem is undecidable. Therefore, no "disaster proof" language or compiler or environment could be proven to be so.

However, it is entirely reasonable to design a language that provides recovery options for various common problems.

It's all a matter of *how much* disaster proofing you can apply.
I agree completely, but in another comment the submitter asked for a computer science-y solution. Otherwise, he/she would have assigned the bounty to one of the other 16 fine, pragmatic, and well-written solutions.
  • First, implement a fault tolerant application. One where, where, if you have 8 features and 5 failure modes, you have done the analysis and test to demonstrate that all 40 combinations work as intended (and as desired by the specific customer: no two will likely agree).
  • second, add a scripting language on top of the supported set of fault-tolerant features. It needs to be as near to stateless as possible, so almost certainly something non-Turing-complete.
  • finally, work out how to handle restoration and repair of scripting language state adapted to each failure mode.

And yes, this is pretty much rocket science.

+2  A: 

In the case of power failure.. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"

You don't solve power failure problems within a program. You solve this problem with backup power supplies, batteries, etc.

This is true - if the Reinier's only problem is power failure. But I believe that the larger concept behind his question is very valid.

Windows Workflow Foundation may solve your problem. It's .Net based and is designed graphically as a workflow with states and actions.

It allows for persistence to the database (either automatically or when prompted). You could do this between states/actions. This Serialises the entire instance of your workflow into the database. It will be rehydrated and execution will continue when any of a number of conditions is met (certain time, rehydrated programatically, event fires, etc...)

When a WWF host starts, it checks the persistence DB and rehydrates any workflows stored there. It then continues to execute from the point of persistence.

Even if you don't want to use the workflow aspects, you can probably still just use the persistence service.

As long as your steps were atomic this should be sufficient - especially since I'm guessing you have a UPS so could monitor for UPS events and force persistence if a power issue is detected.

interesting! This might actually be the very first response in favour of proofing that something like this does exist, or is even possible ;^)
+2  A: 

There are several commercially avaible frameworks Veritas, Sun's HA , IBMs HACMP etc. etc. which will automatically monitor processes and start them on another server in event of failure.

There is also expensive hardware like HPs Tandem Nonstop range which can survive internal hardware failures.

However sofware is built by peoples and peoples love to get it wrong. Consider the cautionary tale of the IEFBR14 program shipped with IBMs MVS. It basically a NOP dummy program which allows the declarative bits of JCL to happen without really running a program. This is the entire original source code:-

             BR    14       Return addr in R14 -- branch at it

Nothing code be simpler? During its long life this program has actually acummulated a bug bug report and is now on version 4.

Thats 1 bug to three lines of code, the current version is four times the size of the original.

Errors will always creep in, just make sure you can recover from them.

James Anderson
That is a wonderful read on Wikipedia, James. Thanks for the information.
Paul Nathan
@Paul,@James: Where's the Wikipedia link?
Ira Baxter

If I were going about solving your problem, I would write a daemon (probably in C) that did all database interaction in transactions so you won't get any bad data inserted if it gets interrupted. Then have the system start this daemon at startup.

Obviously developing web stuff in C is quite slower than doing it in a scripting language, but it will perform better and be more stable (if you write good code of course :).

Realistically, I'd write it in Ruby (or PHP or whatever) and have something like Delayed Job (or cron or whatever scheduler) run it every so often because I wouldn't need stuff updating ever clock cycle.

Hope that makes sense.

Sam Soffes
+1  A: 

This question forced me to post this text

(Its quoted from HGTTG from Douglas Adams:)

Click, hum.

The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.

On board the ship, everything was as it had been for millennia, deeply dark and Silent.

Click, hum.

At least, almost everything.

Click, click, hum.

Click, hum, click, hum, click, hum.

Click, click, click, click, click, hum.


A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.

The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.

The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .

The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.

It couldn't find the look-up table .


It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.

The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.

Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.

This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.

Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.

This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.

This provided the first major clue as to what it was that was wrong.

Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.

The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.

The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.

It relaxed.

Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.

It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.

"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."

All of the rest was complete garbage.

Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.

It must also revive all of its crew.

There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.

Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.

The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.

Heiko Hatzfeld
+1. Very apt.
Paul Nathan

To my mind, the concept of failure recover is, most of the time, a business problem, not a hardware or language problem.

Take an example : you have one UI Tier and one subsystem. The subsystem is not very reliable but the client on the UI tier should percieve it as if it was.

Now, imagine that somehow your sub system crash, do you really think that the language you imagine, can think for you how to handle the UI Tier depending on this sub system ?

Your user should be explicitly aware that the subsystem is not reliable, if you use messaging to provide high reliability, the client MUST know that (if he isn't aware, the UI can just freeze waiting a response which can eventually come 2 weeks later). If he should be aware of this, this means that any abstrations to hide it will eventually leak.

By client, I mean end user. And the UI should reflect this unreliability and not hide it, a computer cannot think for you in that case.

Nicolas Dorier

"So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off."

"continues where it left off" is often not the correct recovery strategy. No language or environment in the world is going to attempt to guess how to recover from a particular fault automatically. The best it can do is provide you with tools to write your own recovery strategy in a way that doesn't interfere with your business logic, e.g.

  • Exception handling (to fail fast and still ensure consistency of state)
  • Transactions (to roll back incompleted changes)
  • Workflows (to define recovery routines that are called automatically)
  • Logging (for tracking down the cause of a fault)
  • AOP/dependency injection (to avoid having to manually insert code to do all the above)

These are very generic tools and are available in lots of languages and environments.

Christian Hayter
+2  A: 

If the mode of failure is limited to hardware failure, VMware Fault Tolerance claims similar thing that you want. It runs a pair of virtual machines across multiple clusters, and using what they call vLockstep, the primary vm sends all states to the secondary vm real-time, so in case of primary failure, the execution transparently flips to the secondary.

My guess is that this wouldn't help communication failure, which is more common than hardware failure. For serious high availability, you should consider distributed systems like Birman's process group approach (paper in pdf format, or book Reliable Distributed Systems: Technologies, Web Services, and Applications ).

Paul Nathan
+5  A: 

There is an experimental language called Napier88 that (in theory) has some attributes of being disaster-proof. The language supports Orthogonal Persistence, and in some implementations this extends (extended) to include the state of the entire computation. Specifically, when the Napier88 runtime system check-pointed a running application to the persistent store, the current thread state would be included in the checkpoint. If the application then crashed and you restarted it in the right way, you could resume the computation from the checkpoint.

Unfortunately, there are a number of hard issues that need to be addressed before this kind of technology is ready for mainstream use. These include figuring out how to support multi-threading in the context of orthogonal persistence, figuring out how to allow multiple processes share a persistent store, and scalable garbage collection of persistent stores.

And there is the problem of doing Orthogonal Persistence in a mainstream language. There have been attempts to do OP in Java, including one that was done by people associated with Sun, but there is nothing active at the moment. The JDO / Hibernate approaches are more favoured these days.

Stephen C
So what part of the *language* supported checkpointing? You can arguably do this with any langauge by having such a runtime system. People already do this with supercomputing applications that run for weeks, because the MTBF of the machine is shorter and they don't want to run the computation from the beginning. But it isn't the language that is helping here. Its storing the VM image.That's easy but not very interesting.
Ira Baxter
@Ira: The language specifies that any object that is reachable from a persistent root is made persistent. Checkpoints (persistent store commits) happen automatically or they can be requested by the application. We are NOT talking about check-pointing virtual memory. Rather it is a process (like GC) where the object tree is traversed and (only) dirty proot-reachable objects are written to the store. Restartability is achieved ensuring that thread/stack state is proot-reachable.
Stephen C