views:

72

answers:

4

Considering that you never need to use a disaster recovery plan until disaster strikes, I was wondering what is a way that an IT department can test their disaster recover plans? How do you simulate the failure of key systems? Is there a way to ensure that your testing is as real-world as possible? Thanks for your suggestions.

+1  A: 

Have a disaster !

If you are confident in your disaster plan you should be able to walk into your data center (or janitors closet) and pull the power plug out.

Tandem used to have a very good demonstration of it's cluster failover stuff - they fired a shotgun through one of the servers while it was running an app.

Martin Beckett
That sounds like a great demonstration/test!
FrustratedWithFormsDesigner
Another one was dropping a safe onto one of a pair of servers
Martin Beckett
A: 

Testing a plan is best done as close to real as possible. As soon as you start simulating this or that is when the test becomes useless. If your management allows power off machines, unplug your lan cables, deploy a virus. Have some fun with it!

Jay
+1  A: 

Most of the time, a disaster is when a server goes down.

Depending on how critical your apps are...just turn off a machine in your test environment and see how other apps and servers, and notification services react.

If they don't react in an acceptable way..you have changes to make.

Ed B
+3  A: 

To add to the above comments, yes have a "disaster". How exactly you do it depends on your BCP. For example if you are using a failover data center you power one off.

You don't have to hard-kill systems to do it. You can simply pull the network; indeed depending on what you are testing that might be preferable.

However, what you will want to do is schedule periodic, regular "outages". My prior team was with a global financial services company. Our systems were 24x7x365 mission critical. Yet we were actually required to perform whole data center outages - and they would last for days. How often you do it depends on the results. No matter how well you simulate or emulate it if you don't do it on production it is basically academic.

You will of course want to have it be "all hands on deck". That way when something goes wrong, if your systems are complex it surely will, you are prepared for it. This is another aspect of good business continuity: nothing ever goes according to plan. By periodically causing a disaster scenario you also train your people in how to handle it when certain things go wrong; you also get to add those lessons to the plan. BCP and DR are not static. I'd recommend at a bare minimum an annual full-system test, and preferably 3-4 times per year. You can, and should, schedule them at "low tide" - a time when your systems are traditionally in a low usage period. For many this means holidays. A three day weekend for example, is a reasonable time.

Not all pieces need to be tested at once. Some things can be tested w/o causing an outage and letting the system handle it. For example you can test your backup and restore process by periodically restoring the data.

As far as convincing the powers that be that this is not a bad idea, consider this. If your plan has a hole in it (and they all do), you can choose to learn about it when you have everybody ready and things can be quickly restored to normal or learn that when a true failure happens. Work your way through testing in production the pieces that can be done individually and use that as a basis to show the PTB that actually testing the "entire system" through a simulated-in-production failure a) has some solid pieces and b) is needed to ensure it really works.

The Real Bill
I just wrote up a blog entry for my office where I made this similar comment. So much of disaster-preparedness is training, and familiarity with said training. We spend a lot of time creating procedures, but almost no time practicing them. For most of us, we don't get to be good at something without a lot of practice.
Chris Kaminski