Steps to error proofing a mission critical process

views:

132

answers:

+2 Q:

Steps to error proofing a mission critical process

I'm writing a program that will continuously process files placed into a hot folder.

This program should have 100% uptime with no admin intervention. In other words it should not fail on "stupid" errors. i.e. Someone deletes the output directory it should simply recreate it and move on.

What I'm thinking about doing is to code the entire program and then go through and look for "error points" and then add code to handle the error.

What I'm trying to avoid is adding erroneous or unnecessary error handling or even building error handling into the control flow of the program (i.e. the error handling controls the flow of the program). Well perhaps it could control the flow to a certain extent, but that would constitute bad design (subjective).

What are some methodologies for "error proofing" a "critical" process?

+1 A:

Unit testing.

VirtuosiMedia 2009-03-18 22:51:58

+3 A:

If your process must be error-proof and have no admin intervention, you must handle all possible errors. If you leave any chance of stopping the program, it will happen (Murphy's Law) and you will not know.

Even handling all possible errors, I think you'll need some logging and even a monitor with (mail?) alerts to be sure your process is always running fine.

Paulo Guedes 2009-03-18 22:53:22

Agreed, presume it will fail and ensure the right thing will happen in that case.

Michael 2009-03-18 22:57:34

+2 A:

The most important thing to do is to document your assumptions in the form of unit tests. You should write a test that violates each assumption, and then prove that your program successfully recovers or takes action to make this state true again.

To use your example, if someone could delete the critical folder, make a test that simulates this and then show that your program handles this case without crashing.

John Feminella 2009-03-18 23:02:15

On technique for thorough analysis is a HAZOP study, where for each part of the process you consider keywords for that process. For a chemical in a process plant, these might be 'more' 'less', 'missing', 'hotter' 'colder' 'leak' 'pressure' and so-one.

When applying HAZOP to software, you would consider keywords appropriate to the objects in your software.

For example, for a reading a file you might consider 'more' to be buffer overrun, 'less' missing data, 'missing' not existing, 'leak' lack of file handles, and so on.

Pete Kirkham 2009-03-18 23:13:53

ansaurus

tags:

views:

answers:

Steps to error proofing a mission critical process

related questions