Let's say you have a .NET system that needs to send out email notifications to a system administrator when there's an error. Example:
try
{
//do something mission critical
}
catch(Exception ex)
{
//send ex to the system administrator
//give the customer a user-friendly explanation
}
This block of code gets called hundreds of times a second by different users.
Now lets's say an underlying API/service/database goes down. This code is going to fail many, many times. The poor administrator is going to wake up to a few million e-mails in their inbox and the developer is going to get a rude phone call, not that such an incident (cough) necessarily occurred this morning.
It's pretty clear that this is not a design that scales well.
The first few solutions that come to mind are all flawed in some way:
- Log errors to the database, then expose high error counts through an HTTP Health Check to an external monitoring service such as Pingdom. (My favourite candidate so far. But what if the database goes down?)
- Have a static cache that keeps track of recent exceptions, and the alert system always checks for duplicates first. (Seems unnecessarily complex, and secondly a lot of error messages differ very slightly - e.g. if there is a time-stamp in the error, it's useless.)
- Programmatically take our system offline after certain errors or based on constant monitoring of critical dependencies (Risky! What if there's a transient false positive?)
- Just not alert on those errors, and rely on a different part of the system to monitor and report on the dependencies. (Doesn't cater for the 'unexpected' errors that we haven't anticipated.)
This seems like a problem that has to have been solved, and that we're going about it in a silly way. Suggestions appreciated, even if they involve a completely different exception management strategy!