views:

48

answers:

2

Let's say you have a .NET system that needs to send out email notifications to a system administrator when there's an error. Example:

try
{
    //do something mission critical 
}
catch(Exception ex)
{
    //send ex to the system administrator
    //give the customer a user-friendly explanation
} 

This block of code gets called hundreds of times a second by different users.

Now lets's say an underlying API/service/database goes down. This code is going to fail many, many times. The poor administrator is going to wake up to a few million e-mails in their inbox and the developer is going to get a rude phone call, not that such an incident (cough) necessarily occurred this morning.

It's pretty clear that this is not a design that scales well.

The first few solutions that come to mind are all flawed in some way:

  • Log errors to the database, then expose high error counts through an HTTP Health Check to an external monitoring service such as Pingdom. (My favourite candidate so far. But what if the database goes down?)
  • Have a static cache that keeps track of recent exceptions, and the alert system always checks for duplicates first. (Seems unnecessarily complex, and secondly a lot of error messages differ very slightly - e.g. if there is a time-stamp in the error, it's useless.)
  • Programmatically take our system offline after certain errors or based on constant monitoring of critical dependencies (Risky! What if there's a transient false positive?)
  • Just not alert on those errors, and rely on a different part of the system to monitor and report on the dependencies. (Doesn't cater for the 'unexpected' errors that we haven't anticipated.)

This seems like a problem that has to have been solved, and that we're going about it in a silly way. Suggestions appreciated, even if they involve a completely different exception management strategy!

+1  A: 

the simplest solution that springs to mind is to assign this exception block an ID number (like, 1) and log the time of the last notification to the administrator. If the elapsed time between notifications is not large enough (say, an hour), don't notify the admin again

if this piece of code typically generates more than one kind of exception, you may want to log the class of the exception also; if the elapsed time between notifications for the same exception is not large enough, don't notify the admin again

Steven A. Lowe
A: 

I've built monitoring apps that email admins before, and I'll sheepishly admit that I've been in your situation. The solution is to rate-limit your emails. Save the time of the last email sent somewhere, and build in a check to see if a minimum amount of time has passed since the last email before sending one (say, 10 minutes, or longer, up to you). That way the maximum amount of emails your poor admin will get will be <time issue has been going on> / <period>. In my previous sysadmin job this balanced our need to know that an issue was still going on with the need to have an email box not bursting with 1000 emails an hour.

Aphex