views:

186

answers:

6

If you had to audit a Java application for worst-practices when it comes to high-availability and disaster recovery, you would probably look for hardcoded IP addresses and suboptimal caching of bind handles. What else should be considered?

+2  A: 

Lack of monitoring facilities. Sooner or later, all applications will fail. When that happens, you'll want to know about it before anyone else does.

Emil H
+2  A: 

Lack of logging. If you can't find what killed your app, it's really hard to fix it. This is particularly nasty when you have very intermittent failures that have hard-to-repro cases.

McWafflestix
+3  A: 

Lack of action/state logging.

A Java application should be able to resume where it was when it crashed.
That means there should be a mechanism able to record what has already done (in order to not do everything all over again at the next run).

That also means such a Java program should always achieve the same state after the same set of actions. (Doing something twice would result in the same result, and the actions already done should not be done again, but simply skipped)

That record can take many form (file, database, metadata in a repository of sort, ...), but the point is: a Java application willing to recover as fast as possible should know what it has already done.

VonC
+3  A: 

Since proper monitoring is already mentioned, I would add having a contingency plan in place. It can be something as simple as: if this happens then we do this, if this other thing happens then we do that. Then when problems occur you just follow the (previously tested) plan instead of having everyone panic and taking quick decisions.

Pablote
A: 

The best thing to do is to schedule some down time and test it. You will find many more problems doing this. Once you have everything documented, get someone else to do it without your help. ;)

Peter Lawrey
A: 

As I see it there are a couple key aspects to what you are asking about. I don't think it is language specific, and you used a java app as an example so I hope you don't mind me not talking specifically about Java.

Failover/HA: This is where you identify your SPoF - Single Points of Failure. Examples include hardcoded addresses as you mentioned, but also applications that store data in non-replicable means such as a local disk. Other items might be caching DNS lookups for "too long", not re-establishing severed connections, looking for specific hardware information (such as MAC addresses, CPUIDs, dongles, partition labels, MB or drive serial numbers, etc.). I've seen all of these as problems leading to unnecessary workarounds to get BCP/DR functional.

Data Integrity: How is the data stored? Does it use a custom format/structure? If so is there a "dump and restore" mechanism? Does the service need to stop servicing clients, or does it degrade it's service to do backups? Does it write data to a device asynchronously and if so how often is it "flushed" to disk (sometimes this is up to the app, others not so much)? File locking, memory-to-persistent storage timeframes and capabilities are also part of this.

Essentially look at what would cause you to have to work around. Then look at how that came abut and you'll probably start developing two important bits of knowledge: Patterns to use to improve BCP/DR, and as you mentioned, AntiPatterns that cause problems. Injecting these types of questions into the development process, as early as is feasible will help your developers derive the patterns and anti-patterns you are looking for. Often just asking the questions prevents the problems.

The Real Bill