views:

163

answers:

9

I suspect that all non-trivial software is likely to experience situations where it hits an external problem it cannot work around and thus needs to fail. This might be due to bad configuration, an external server being down, disk full, etc.

In these situations, especially if the software is running in non-interactive mode, I expect that all one can really do is log an error and wait for the admin to read the logs and fix the problem. If someone happens to interact with the software in the meantime, e.g. a request comes in to a server that failed to initialize properly, then perhaps an appropriate hint can be given to check the logs and maybe even the error can be echoed (depending on whether you can tell if they're a technical guy as opposed to a business user). For the moment though let's not think too hard about this part.

My question is, to what extent should the software be responsible for trying to explain the meaning of the fatal error? In general, how much competence/knowledge are you allowed to presume on administrators of the software, and how much should you include troubleshooting information and potential resolution steps when logging fatal errors? Of course if there's something that's unique to the runtime context this should definitely be logged; but lets assume your software needs to talk to Active Directory via LDAP and gets back an error "[LDAP: error code 49 - 80090308: LdapErr: DSID-0C090334, comment: AcceptSecurityContext error, data 525, vece]". Is it reasonable to assume that the maintainers will be able to Google the error code and work out what it means, or should the software try to parse the error code and log that this is caused by an incorrect user DN in the LDAP config?

I don't know if there is a definitive best-practices answer for this, so I'm keen to hear a variety of views.

A: 

I don't think the code should explain the error - the documentation should.

anon
+3  A: 

The answer, as for all broad questions, is "it depends."

If you're looking at a configuration error, then by all means you should try to explain what was wrong (in the logs). If it's an out-of-memory error, there's not much you can do -- and you may not even be able to write a log message.

One thing you said concerns me:

If someone happens to interact with the software in the meantime, e.g. a request comes in to a server that failed to initialize properly, then perhaps an appropriate hint can be given to check the logs

If this is truly a fatal error, the server should not be running, and therefore any incoming request should fail with absolutely no warning or explanation.

kdgregory
+1  A: 

I guess it depends on how much time you have before delivering the software to your customers.

Yes, it would be nice to parse the error and give a more explicit message but, in this day and age, Google is not always very far.

So unless, you have time to create the code to parse errors, I would leave them as is.

David Segonds
+2  A: 

You should at least provide the message from the exception and a stack trace so you can find out where in the code it occurred. If possible, you should also explain what you were attempting to do and what you think may have happened depending on the exception type.

tvanfosson
+2  A: 

The approach I tend to agree with is that you should explain as much as possible if the fatal error is caused by some code in your own responsibility (i.e. not third party). Otherwise if the error is caused "further down", for example at the database level, then the administrators should be passed up the error returned without adding much further information. So if the database server dies, then your connector with throw some exception, and you would log the error code in the exception.

The administrator or support staff should then have sufficient knowledge to resolve the issue with the provided information.

When you do provide too much details on errors which are not caused by your own code you run the risk of having error details NOT matching the cause of the actual error, especially if the error codes stop matching between versions.

Of course, there are exceptions. We have worked with open source libraries that were so poorly documented that we ended up writing wrappers around the libraries just to provide decent logging of what actually is going on.

Just my 2c

Il-Bhima
A: 

IMHO you can never provide too much information in these case.

In the real world it comes down to cost-benefit analysis, though. What's the impact of the error to you, your app, your business, etc. How much time is it worth spending on it.

In a business critical app my first point applies. Everything else is a sliding scale.

Phil Nash
A: 

I think it depends on who is using the application.

If the application is used by tech savvy people then show more technical details, so they will be able to troubleshoot the problem if they want. I've had some users go to great lengths to solve issues. It can be very helpful, especially for issues that are specific to certain configurations.

If your user base is more of the average Joe then technical details will confuse them in most cases. You should show them a simple error message, and try to offer some solutions if possible.

You could also merge the two techniques. Show a simple error message by default and allow the user to view more detailed error information if they want.

You just don't want to overwhelm the user with too much information that they don't understand. It just frustrates and confuses them in the majority of cases.

Dana Holt
A: 

There are two aspects I think all errors and exceptions should have:

1) Enough information in the error to help debug the problem. Stacktrace, class/method name, type of exception etc fall in this category.

2) A human understandable message, ideally clear enough for say Ops team or Sysadmins engineer to know who to call or forward that error message. Typically it is of the form "so and so module failed" or "network call failed" etc. Something that will come as close to you explaining the problem to customer, in non technical jargon.

Now with all the time constraints etc it may not be possible to have both messages programmed in. Then I would go out on a limb and say we should have the second type of error message. Remember, the sysadmin would probably be able to call you and since you helped write the code you can maybe pinpoint the error. But if the customer is on phone asking about the error, the sysadmin better be able to explain the possible cause :)

On a different note, all products need a clear exception/error handling mechanism decided at architecture level. And the exceptions NEED to adhere to that design. There are few things more frustrating than trying to debug an error based on a design only to find out a day later that its a one of a kind error message based on completely different design.

omermuhammed