views:

118

answers:

3

Note: This is not for unit testing or integration testing. This is for when the application is running.

I am working on a system which communicates to multiple back end systems, which can be grouped into three types

  • Relational database
  • SOAP or WCF service
  • File system (network share)

Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(

The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.

My questions around this design are:

  1. Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
  2. Are there better ways to implement this sort of design?
  3. Would using traditional exception handling and/or transactions be better?

Updates

  • The system needs to talk to the back end systems in a coordinated way.
  • The system is very async in nature so using things like queuing technologies is fine.
  • The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.
+1  A: 

The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.

leppie
It's not exactly what our current thinking is since we would like to say at point x do a test while the connectionmonitor looks like it would tell us (please correct me - this is the first time I've heard of it). however i do like it and am thinking of ways it could help in other areas.
Robert MacLean
I see there is an updatestatus method which looks like it would work for our needs. how well would this fit with something not network connectivity like a database? and taking it further if the database is in a usable state (permissions, table exists etc...)
Robert MacLean
We only use it for HTTP, but I believe it can be extended. I have not gone down that route yet.
leppie
+2  A: 

You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.

You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.

Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.

In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.

And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.

krosenvold
The systems do need to interact in a coordinated matter, but we have taken care of the transaction manager functionality already. We are also using a queuing technology a lot (although not in the way you wrote about) as the system has to be async.
Robert MacLean
A: 

Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.

If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.

At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

MrTelly
@MrTelly the problem is not prior to launch of the application, it is while the app is running. The reason why pre-launch wouldn't work well for us is that not all routines use the same systems, so stopping all processing because one routine is down wouldn't be effective or even an option.
Robert MacLean