What should be included in the state-of-the-art error and exception handling strategy?

views:

805

answers:

+14 Q:

What should be included in the state-of-the-art error and exception handling strategy?

I understand that this is a very broad question, but a short “it depends” kind of answer will not be accepted. Strategies are born to deal with broad issues.

What issues should an application designer take into consideration when devising the error and exception handling strategy?
How the strategy will differ depending on the software type (COTS, in-house business app, consultingware, game, hosted web app, embedded etc)? Is the software type important?
Ethical, political and legal issues?
Various perspectives on error handling (user, developer, business support, management).

Some ideas that I would have explored:

Various error reporting routes (i.e. UI, logging, automatic admin notification).
Defence in depth and robustness (failover contingency and fail-safe mechanisms, recovery against problems that are not yet known).
Treating users and customers fairly (i.e. minimising the impact on software users and other people serviced by software).

I'm looking for a similar list of ideas and concepts.

Please do use comments to point me out if I need to clarify the question further and thanks to everyone contributing!

FAQ

Development Platform (Java, .NET, mobile) — will definitely have some affect on the resulting implementation detail of the strategy from a developer perspective but less so from users' point of view.

Fools day it is certainly not. Most legacy systems I was asked to work on did not have a clear error handling strategy.

Could this be made a community wiki? No. It seems as a good question and good questions are hard to come up with.

What do you mean by the strategy? A long term plan that gives direction, focus, brings consistency and coordination to error and exception handling. In case of a larger team working on software the strategy can be formilised and distributed in a written form.

It seems to be duplicate question (see Best practices for exception management in Java or C and Which and why do you prefer exceptions or return codes) These questions deal with a certain perspective on error handling (mostly developer), I'd like to learn more about other perspectives and how they contribute to the overall strategy.

+5 A:

There are so many possible answers here, but I'll take a crack at it.

What issues should an application designer take into consideration when devising the error and exception handling strategy?

When you have multiple developers, it should be easy to "hook into" your error handling framework, otherwise people won't use it.
Use transactions wisely to maintain data consistency. I see apps all the time where a failure could occur halfway through a process and cause wierd data inconsistencies because the entire operation was not rolled back properly.
Consider criticality when you handle exceptions. For example, if you have an online ordering system and part of that workflow is to have an e-mail sent to the site owner letting them know that a new order was placed. If sending that e-mail were to fail, should the user get an error and the whole order be cancelled?

How the strategy will differ depending on the software type (COTS, in-house business app, consultingware, game, hosted web app, embedded etc)? Is the software type important?

For desktop type or embedded apps, recording information about the environment (os version, hardware, other apps running, etc) can be very useful when investigating error reports.
For enterprise apps and web apps, things like e-mail error notifications, SMS messaging and integration with ECO tools (e.g. Tivoli) become very useful.

Ethical, political and legal issues?

The only thing I can think of here would be for desktop apps - "phone home" type applications are generally frowned upon, especially if they submit information about the users machine that could be sensitive.

Various perspectives on error handling (user, developer, business support, management).

From a user perspective, try to avoid errors by designing the interface in such a way that it is difficult for them to make mistakes. Don't ask questions that the user probably won't be able to answer (Abort, Retry, Fail anyone?)
From a developer perspective, you'll want as much information as possible to help diagnose what happened - stack trace, environment info, etc.
From a business support & management standpoint, they'll want to know what to do about the error (mostly in an enterprise environment) - who is responsible for the application (who do I call/page/etc?) as well as the criticality and any possible side effects (e.g. if this batch job fails, what business processes will that affect?). Written documentation is your friend here.

Eric Petroelje 2009-03-31 16:00:07

+3 A:

I'm coming from a Java background, but my response should apply to .Net, as well.

Rules of thumb:

Write your code to fail fast: Hunt & Thomas; Tip 33
Test all of your parameters with a param check library - these are not exceptional conditions. They are misuse of the (documented) API. Example: google collections Predicates
Use Exceptions for exceptional conditions: [Hunt & Thomas]; Tip 34. Exceptions should NOT be used as return codes.
Test for exceptional conditions: Exceptions are postconditions for method invokations. If you can't get there with a test, the Exception shouldn't be declared.
(For Java) Follow Josh Bloch's advice (all of Chapter 9). Some important tips: 5a. Throw exceptions appropriate to the abstraction. 5b. Strive for failure atomicity. 5c. Include failure-capture information in the detail message (or encapsulate it in the Exception itself). 5d. Don't ignore Exceptions.

jasonnerothin 2009-04-06 18:23:54

+3 A:

I ran across some of these issues at work - didn't really have a chance to explore it there though. My thoughts:

What issues should an application designer take into consideration when devising the error and exception handling strategy?

The ideal exception handling strategy would be a complete recovery and logging of the error. The catch-22 - if you could do such a thing, wouldn't you have written it in the code in the first place? As such, it's not really an "exception" per se, plus your implementation complexity goes exponential. The other side of this would be in the realm of autonomic systems and the "self-healing software" approach. I believe the most realistic strategy is to always try and force the system into a consistent state (i.e. minimal damage). You will always be forced to trade-off something - loss or corrupted data, loss of resources resulting in reduced performance, etc; however, being in a consistent state increases your chance of staying operational at a diminished capacity rather than face a total shutdown. Formalizing a consistent state among the project team could mean establishing natural default values which would be used as a reset state.

How the strategy will differ depending on the software type (COTS, in-house business app, consultingware, game, hosted web app, embedded etc)? Is the software type important?

I think each type of software lends itself to different auditing and QoS requirements, and it is reflected in the costs associated with downtime and / or data corruption; however, the general strategy is the same. With embedded, the strategy is to minimize the appearance of the problem to the user and create logs. You can achieve this by restarting the software quietly (i.e. reset the state). With hosted web apps, the session data from a crash can be dumped for later analysis and the user gets a new session. For a game (especially for things like MMORPG), you invest in maintaining snapshot data to prevent gamers from losing progress in the event of a server failure. Server clustering and fail-over techniques are also very important in these implementations.

Ethical, political and legal issues?

Transparency is probably the most important part of error and exception handling, which would come in the form of maintaining auditing. The end result of those issues is demonstrating the system failure (should any collateral damage ensue) is a result of an unpredictable chain of events which cannot be reasonably foreseen by the designers. It's also important to demonstrate that whatever handling mechanisms in place had a positive effect by reducing damages, etc. Keeping users in the loop in the face of a catastrophic failure is also important (i.e. "Where did my WoW server go????"), but my main point is that transparency should be applied to disciplined auditing for the purposes of reconstructing the failure.

Various perspectives on error handling (user, developer, business support, management).

As a user, error handling should be totally invisible. If a server crashes, I still want my bank transaction to be completed as scheduled without having to call the bank and rerunning the transaction.

As a developer, error handling is the most difficult part of the application to design. The number of things which can go wrong, resulting from both people and technology factors, and how to classify them into cases which we can write code to handle is immensely difficult. We depend on the project budget and management to guide these decisions, but in the end, it's still like playing a game of Russian Roulette.

For business support & management, I suppose error handling would be like the insurance paid during the software development phases which reduce the incidences of having to compensate customers who experience inconveniences or outages due to software failure. It's also a measure of software quality and accountability (i.e. they want to know which division / group / developer was responsible).

slau 2009-04-10 04:18:26

+3 A:

It is important to get as much information as possible about errors that are occurring back to the development team. Log files are good in cases where there are no users to experice the error condition and you can be certain that someone is checking the log file. Automatic email is great for server based applications. Alert messages are problematic because users never read them. One trick that's worked for me is to copy a detailed error trace on to the clipboard while a user friendly error is displayed, then train users to paste the error trace into an email error report. The web equivalent is to display a friendly message while sending a detailed error in an email to the development team from the server.

There should be a log of last resort, in other words, what happens when writing to the log file causes an error? There should also be built in protection against "sorcerer's apprentice" type problems in which error handling itself locks the system up. On desktop systems, sloppy error handling code can result in a never ending cascade of message boxes that leave no option but to kill the app, possibly losing data in the process. Similar problems can result if error handling code triggers exceptions. The error handling framework should detect error handling errors and stop reporting errors if there is no better option.

For vital batch processes, nothing beats a proactive notification of success. If the "batch complete" email doesn't arrive, the user knows something's up, even if the error handling is fubar.

Exceptions should be caught at boundaries. All event handlers, public component functions, and service methods should catch all exceptions that occur. In some cases, re-throwing an exception makes sense; for example, when an exception is caught in a web service method, a SOAP exception should be thrown. But it is a bad idea to allow an excpetion to percolate across a component boundary automatically.

Conversely, it is usually a bad idea to catch exceptions on private methods of classes, or in methods that are nested in the middle of a complex internal process of a component. It doesn't make sense to handle an exception in this context unless you can recover from the exception. This internal code must be structured so that all resources will be released and database transactions rolled back in the presence of exceptions. Catch blocks in every method are the sign of chaos, using and finally blocks are a sign of a sound error handling framework.

Remember that exceptions are exceptional (if you were expecting them, they wouldn't be called exceptions!) Rather than trying to anticipate when errors might occur, concentrate on shoring up your component boundaries. Even trivial code that could not possibly experience an error should have a catch block if it sits on a boundary. That way when the code is modified later in unexpected ways the architecture will still hold.

Each component boundary may require a different reporting mechanism. In the case of components that are design to run in different contexts, provide an error handling interface that client code can use to catch error messages. Don't forget the log of last resort if someone forgets to hook the error handling interface.

To sum up:

Get detailed error information back to the development team reliably.
Trap errors always at component boundaries and only at component boundaries.
Make all code exception safe.
Don't let the error handling framework become part of the problem.

Paul Keister 2009-04-12 06:12:25

Great ideas, however I'm against messing around with user's clipboard.

Totophil 2009-04-13 10:34:13

To be clear, I'm talking about giving them a Paste button, not forcing anything on to the clipboard.

Paul Keister 2009-04-19 17:55:07

+1 A:

I don't intend to win bounty but here are some strategies that I have used and were well received:

Extracting informatin from sub-components and mapping them to functional units helped our business analysts and end-users to understand the errors better
Assigning a business priority level will help depending upon the domain you are operating.
A Seperate Error Viewer App helped us view the errors before they were reported so my teams can start fixing them.
System level exceptions are better when they are not messed with.
Async logging of errors will help a great deal in the overall strategy and design.
Create Domain-driven error strategy: meaning the errors whould correspond to failure of some business logic. Of course, most should be handled by developers, but there are certain scenarios that you may run into if you are working on message routing between various enterprises in trading engines, etc

CodeToGlory 2009-04-13 03:38:02

ansaurus

tags:

views:

answers:

What should be included in the state-of-the-art error and exception handling strategy?

related questions