views:

172

answers:

7

How often do you solve your problems by restarting a computer, router, program, browser? Or even by reinstalling the operating system or software component?

This seems to be a common pattern when there is a suspect that software component does not keep its state in the right way, then you just get the initial state by restarting the component.

I've heard that Amazon/Google has a cluster of many-many nodes. And one important property of each node is that it can restart in seconds. So, if one of them fails, then returning it back to initial state is just a matter of restarting it.

Are there any languages/frameworks/design patterns out there that leverage this techinque as a first-class citizen?

EDIT The link that describes some principles behind Amazon as well as overall principles of availability and consistency: http://www.infoq.com/presentations/availability-consistency

+1  A: 

Embedded systems may have a checkpoint feature where every n ms, the current stack is saved. The memory is non-volatile on power restart(ie battery backed), so on a power start, a test is made to see if the code needs to jump to an old checkpoint, or if it's a fresh system.

I'm going to guess that a similar technique(but more sophisticated) is used for Amazon/Google.

Paul Nathan
ennuikiller
Depends on how they are programmed. I wrote a system that kept on chugging - as long as the battery was good, the system didn't notice failures. This generally is a much different approach than an application/process driven system like the iPhone/Blackberry though. I suspect ip/bb don't do a checkpoint feature...it's a full restart. Yes? No/
Paul Nathan
+3  A: 

This is actually very rare in the unix/linux world. Those oses were designed (and so was windows) to protect themselves from badly behaved processes. I am sure google is not relying on hard restarts to correct misbehaved software. I would say this technique should not be employed and if someone says that the fatest route to recovery for their software you should look for something else!

ennuikiller
That's a good point, but maybe I put accent wrongly in my question. I acutaly was intrested in knowing if this techinique is used on other levels (not only OS) and if it is recongnised as a design pattern.
Superfilin
+1. To me, rebooting is not really a pattern, it's more a last resort attempt used by people who have no clue about what is really happening, what the real problem is and so how to solve it. And actually, rebooting is not so common IMO (I've seen many unices rebooted only because of an OS patch, and this is not that frequent).
Pascal Thivent
If we are talking about a system that runs on a single (or few) server, then rebooting that, of course, is not a pattern, because you will need to stop all operations. I've heard of UNIX machines that were not rebooted for years. But if you have a cluster of 100 nodes and some of them start behaving badly, then if node restart takes long time, then you risk to kill the whole system under high load. So, my actual question was if this pattern is recognised and used in other areas of software development and not only in server/clustering.
Superfilin
+2  A: 

Microcontrollers typically have a watchdog timer, which must be reset (by a line of code) every so often or else the microcontroller will reset. This keeps the firmware from getting stuck in an endless loop, stuck waiting for input, etc.

Unused memory is sometimes set to an instruction which causes a reset, or a jump to a the same location that the microcontroller starts at when it is reset. This will reset the microcontroller if it somehow jumps to a location outside the program memory.

Jeanne Pindar
+1  A: 

Though I can't think of a design pattern per se, in my experience, it's a result of "select is broken" from developers.

I've seen a 50-user site cripple both SQL Server Enterprise Edition (with a 750 MB database) and a Novell server because of poor connection management coupled with excessive calls and no caching. Novell was always the culprit according to developers until we found a missing "CloseConnection" call in a core library. By then, thousands were spent, unsuccessfully, on upgrades to address that one missing line of code.

(Why they had Enterprise Edition was beyond me so don't ask!!)

Austin Salonen
+1  A: 

If you look at scripting languages like php running on Apache, each invocation starts a new process. In the basic case there is no shared state between processes and once the invocation has finished the process is terminated.

The advantages are less onus on resource management as they will be released when the process finishes and less need for error handling as the process is designed to fail-fast and it cannot be left in an inconsistent state.

parkr
+1 good point, parkr :)!
Superfilin
+1  A: 

I've seen it a few places at the application level (an app restarting itself if it bombs).

I've implemented the pattern at an application level, where a service reading from Dbase files starts getting errors after reading x amount of times. It looks for a particular error that gets thrown, and if it sees that error, the service calls a console app that kills the process and restarts the service. It's kludgey, and I hate it, but for this particular situation, I could find no better answer.

AND bear in mind that IIS has a built in feature that restarts the application pool under certain conditions.

For that matter, restarting a service is an option for any service on Windows as one of the actions to take when the service fails.

David Stratton
+2  A: 

This is common in the embedded systems world, and in telecommunications. It's much less common in the server based world.

There's a research group you might be interested in. They've been working on Recovery-Oriented Computing or "ROC". The key principle in ROC is that the cleanest, best, most reliable state that any program can be in is right after starting up. Therefore, on detecting a fault, they prefer to restart the software rather than attempt to recover from the fault.

Sounds simple enough, right? Well, most of the research has gone into implementing that idea. The reason is exactly what you and other commenters have pointed out: OS restarts are too slow to be a viable recovery method.

ROC relies on three major parts:

  1. A method to detect faults as early as possible.
  2. A means of isolating the faulty component while preserving the rest of the system.
  3. Component-level restarts.

The real key difference between ROC and the typical "nightly restart" approach is that ROC is a strategy where the reboots are a reaction. What I mean is that most software is written with some degree of error handling and recovery (throw-and-catch, logging, retry loops, etc.) A ROC program would detect the fault (exception) and immediately exit. Mixing up the two paradigms just leaves you with the worst of both worlds---low reliability and errors.

mtnygard
@mtnygard. Thanksfor the link :). I didn't know about ROC concept.
Superfilin

related questions