Testing fault tolerant code

views:

answers:

+4 Q:

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various reconciliation mechanisms with external systems to check the accuracy of our logs.

This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:

Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
Add code the application that will cause it to crash a certain know critical points

My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.

+3 A:

I think there are three ways to deal with this, if available I could suggest a comprehensive set of integration tests for these various pieces of code, using dependency injection or factory objects to produce broken actions during these integrations.

Secondly, running the application with random kill -9's, and disabling of network interfaces may be a good way to test these things.

I would also suggest testing file system failure. How you would do that depends on your OS, on Solaris or FreeBSD I would create a zfs file system in a file, and then rm the file while the application is running.

If you are using database code, then I would suggest testing failure of the database as well.

Another alternative to dependency injection, and probably the solution I would use, are interceptors, you can enable crash test interceptors in your code, these would know the state of the application and introduce the above listed failures at the correct time, or any others you may want to create. It would not require changes to your existing code, just some additional code to wrap it.

Justin 2010-05-03 09:26:15

Where can I find more information about "crash test interceptors" on the .NET platform? Although we use DI for unit tests, I don't think it will work too well for integration tests. There are two reasons for this: firstly we want integration tests to be as close as possible to the code that will run in production and secondly requiring injected code to cause the failure would have a significant (and undesirable) impact on how we designed the various modules of the application.

Robert 2010-05-03 09:33:16

Hi Robert, there is some good reading here http://www.sharpcrafters.com/aop.net/runtime-weaving

Justin 2010-05-03 09:44:29

Also a good example using Spring.NET http://www.developer.com/net/csharp/article.php/3795031/Aspect-Oriented-Programming-AOP-with-SpringNet.htm

Justin 2010-05-03 09:49:11

Thanks Justin, an interesting resource.

Robert 2010-05-03 09:51:58

Some additional options that may be easier than spring (unless you're already using spring)http://ninject.org is a great library and interceptors are provided by http://github.com/idavis/ninject.extensions.interceptionNanoContainer may also be an option at http://nanocontainer.codehaus.org/NanoContainer.NET but I am unsure of the status of AOP support in the .NET version.

Justin 2010-05-03 11:18:46

+2 A:

A possible answer to the first point is to multiply experiments with your external process so that probability to impact problematic parts of code is increased. Then you can analyze core dump file to determine where the code has actually crashed.

Another way is to increase observability and/or commandability by stubbing library or kernel calls, i.e., without modifying your application code.

You can find some resources on Fault Injection page of Wikipedia, in particular in Software Implemented Fault Injection section.

mouviciel 2010-05-03 09:36:21

Thanks, the fault injection page is a useful resource.

Robert 2010-05-03 09:46:50

+1 A:

I was just about to write the same as Justin :)

The component I would suggest to replace during testing could be the logging component (if you have one, if not, I'd strongly suggest to implement one...). It's relatively easy to replace it with code that generates error and the logger usually gets enough information to know the current application state.

Also it seems to be feasible to make sure that the testing code doesn't go into production. I would discourage conditional compilation though but rather go with some configuration file to select the logging component.

Using "random" kills might help to detect errors but is not well suited for systematic testing because of its non-determinism. Therefore I wouldn't use it for automatic tests.

MartinStettner 2010-05-03 09:37:25

Having the logger be responsible for injecting the faults is a interesting idea, and something that just hadn't occurred to me. Will definitely give it ago.

Robert 2010-05-03 09:48:19

+1 A:

Your concern about fault injection is not a fundamental concern. You merely need a foolproof way to prevent such code ending up in deployment. One way to do so is by designing your fault injector as a debugger. I.e. the faults are injected by a process external to your process. This already provides a level of isolation. Furthermore, most OS'es provide some kind of access control which prevents debugging unless specifially enabled. In the most primitive form, it's by limiting it to root, on other operating systems it requires a specific "debug privilege". Naturally, on production nobody will have that, and thus your fault injector cannot even run on production.

Practially, the fault injector can set breakpoints at specific addresses, i.e. function or even line of code. You can then react to that, e.g. by terminating the process after a certain breakpoint is hit three times.

MSalters 2010-05-03 09:52:57

Your right that fault injections leaking into production may or may not be a problem depending on your implementation of fault injection. Implementing fault injection via a debugger is certainly an interesting approach. There seems to be at least on extension to windbg that would allow you to do this (http://www.woodmann.com/collaborative/tools/index.php/PyDbgEng). Are there any other approaches that work well in windows/.NET environment?

Robert 2010-05-03 10:14:41

ansaurus

tags:

views:

answers:

Testing fault tolerant code

related questions