views:

162

answers:

2

Our team has a number of processes which we run manually but which may run for many days. The processes do various things to large numbers of entities (web pages, database rows, images, files, etc). Obviously from time to time there are failures and we have to design or processes to handle these failures gracefully and move on so the whole job is not brought down.

Depending on the particular process in question, the rate, severity and urgency of failures varies. In some cases we send emails when a rare but important error happens, in other cases we just log it and move on, and so on.

The problem is that we have different error handling code scattered everywhere and more often than not when we "log it and move on" no one ever goes back and reads the logs, so no one ever knows what problems occurred. We can't default to email for all problems because there would simply be too many emails.

These are long running processes but not daemons where something like SNMP or Nagios might feel like a good fit. Surely this is a fairly common problem but I cannot seem to find many solutions online. I've heard people talking about using log4j (or other similar logging packages) to log to a database, etc. which seems like it might be a step in the right direction, but surely there are more sophisticated solutions out there by now..? I'm imagining something where your logger writes events to a database and there's a Nagios-like web interface that lets you see what errors are happening with what processes in real time as well as configure email alerts for specific patterns, etc.

Does something like this exist? If not, what approaches have you used to successfully deal with similar issues?

(For what it's worth most of our codebase is in python but I would imagine any decent implementations of this idea are largely non-anguage-specific and obviously any conceptual solutions would be as well).

Update: I just spent some time looking at Chainsaw, which is kind of what I am looking for, but I'd like it to be a webapp instead of a desktop app, and have alerting functionality.

Update: I just discovered hoptoadapp and exceptional which are both somewhat along the lines of what I was thinking, though both target Rails specifically.

+1  A: 

Well, it seems like the a workable solution would be to digest the error logs. Every nite have a process go through the error logs and roll up the error/warning/etc for the day and put those into an email. You could even group them by severity and/or application if you so desired.

In the end you get just one email a day with all the info right there at your fingertips. Not a "quick" or even elegant solution but could be very workable in the long run.

This also doesn't afford any real-time options. But from this you could grow it into a more real-time solution. It wouldn't be that hard to write a process that monitors log files for changes and then fires off some rules based on the last error message. It is the parsing that gets tricky. ;) Good luck.

Craig
A: 

I think what you need here is too specific to find something already built that would nicely fit your needs. But...

What you described about log4j seems great for me: once you have the errors logged into the DB, a simple web-app would let you take a look at them, filter and set up patterns to fire emails such as errors from a specific app, error level threshold, message containing some regex, etc.

Also, you'll need some small cronjob which would connect to the DB, search for new records (based on last time checked) matching the email criteria and send them out.

Coding all this shouldn't take more than a few days at worst and, for what it's worth, you will end up having a 100% custom tool for you.

Seb