views:

77

answers:

1

In an environment with a SQL Server failover cluster or mirror, how do you prefer to handle errors? It seems like there are two options:

  1. Fail the entire current client request, and let the user retry
  2. Catch the error in your DAL, and retry there

Each approach has its pros and cons. Most shops I've worked with do #1, but many of them also don't follow strict transactional boundaries, and seem to me to be leaving themselves open for trouble in the event of failure. Even so, I'm having trouble talking them into #2, which should also result in a better user experience (one catch is the potentially long delay while the failover happens).

Any arguments one way or the other would be appreciated. If you use the second approach, do you have a standard wrapper that helps simplify implementation? Either way, how do you structure your code to avoid issues such as those related to the lack of idempotency in the command that failed?

A: 

Number 2 could be an infinite loop. What if it's network related, or the local PC needs rebooted, or whatever?

Number 1 is annoying to users, of course.

If you only allow access via a web site, then you'll never see the error anyway unless the failover happens mid-call. For us, this is unlikely and we have failed over without end users realising.

In real life you may not have nice clean DAL on a web server. You may have an Excel sheet connecting (most financials) or WinForms where the connection is kept open, so you only have the one option.

Fail over should only take a few seconds anyway. If the DB recovery takes more than that, you have bigger issues anyway. And if it happens often enough to have to think about handling it, well...

In summary, it will happen that rarely that you want to know and number 1 would be better. IMHO.

gbn
Can't you avoid an infinite loop with a retry counter? If your failovers only take a few seconds, you're lucky. Most SQL Server cluster-based systems I've worked with take at least 30 seconds to fully roll everything back, plus additional delays while cache on the standby server fills--and it can be as long as 2 minutes. Mirrors are only a few seconds, but most shops I work with aren't using them (yet).
RickNZ
By a few, I do mean 20-30 seconds: users don't notice. What would be a valid retry count? Does it handle timeouts, deadlocks? Will it retry only for forcible disconnects? etc etc
gbn
I normally use a retry count of 2, with a delay in between. The idea isn't to prevent the user from ever seeing an error; just to minimize the chance of it happening. How do you make sure inserts that succeeded just before a command that failed in the same page don't get issued a second time after a failure? Anything more than just careful transaction design?
RickNZ
We use TXNs to make each single call completely atomic.
gbn