views:

37

answers:

4

I have a multistep process where each step does some network IO (web service call) and then persists some data. I want to design it in a fault tolerant way so that if the service fails, either because of a system crash or one of the steps fails, I am able to recover and re-start from the last error free step.

Here is how I am thinking of addressing this (this is pretty high level):

  1. Stored the state of each step (NOT_STARTED, IN_PROGRESS, FAILED) in a database table
  2. If a step fails mark it and its dependent step as "FAILED" and move to the next non dependent step
  3. Recover by reading this table (e.g in a bootstrap portion of the application)

I was wondering if there are some design patterns, frameworks and algorithms that address this problem.

A: 

You may consider Chain Of Responsibility Design Pattern: http://en.wikipedia.org/wiki/Chain-of-responsibility_pattern

Andriy Buday
A: 

Memento (GoF) could be used to store the state before a potentially failing call.

One good keyword to search for would be transactions. That allows you to roll-back changes that occurred on account of a failure to the nearest "stable" state. This is one thing your DB would be providing.

The Command pattern also has known uses of transactions.

Paul Rubel
A: 

In terms of behavioral design patterns, I would recommend looking into the following, as they seem well equipped for your needs. Keep in mind that this list is based off of a very high-level understanding of your implementation.

  • Template method - For defining program skeletons
  • Strategy - For swapping algorithms as needed
  • Memento - For restoring objects to their previous states
  • State - Coupled with the memento patterns

If you're not already familiar, I would STRONGLY recommend looking up the Model-View-Controller and Model-View-Presenter patterns, as they will make your development experience much more enjoyable.

If you have any followup questions, feel free to ask. :)

Ari Patrick
A: 

This is a nice paper, "Design patterns for checkpoint based recovery", that addresses the problem.

numes