views:

145

answers:

4

For my current web development project I'm implementing a back end system that will flag errors and send an email to the administrator automatically with details about what occurred. Trapping the error and generating the email with appropriate error information is pretty straight forward; but a problem arises when one considers certain groups of error types especially if the site is being visited frequently.

Consider a couple of examples:

  1. An unplanned database outage that prevents all of the scripts on the web server from being able to connect. If it takes say 2 minutes (120 seconds) for the database server to come back online, and the web server is receiving unique requests at a rate of 10/second, in the time it takes the database server to come back online the admins email would be flooded with 1200 identical emails all screaming about a failure to connect to the database.
  2. A bug in a script somewhere managed to sneak by testing and is of the variety that completely screws up content generation and occurs only in a specific set of circumstances (say once every 100 requests). Using the unique request rate of 10/second again means the administrator is going to be getting the same email every 10 seconds about the same bug until it is fixed.

What are some approaches/strategies I can use to prevent this scenario from occurring? (I am only interested in monitoring of errors generated by the script, infrastructure issues are beyond the scope of this solution)

I going to assume that I can almost always uniquely identify errors using a digest of some of the values passed to the error handler callback set by set_error_handler.

The first and probably most obvious solution is recording in a database and only send the email if a reasonable minimum period of time has passed since it last occurred. This isn't the ideal approach especially if the database is causing the problem. Another solution would be to write files to disk when errors occur and check if a reasonable minimum period of time has passed since the file was last modified. Is there any mechanism to solve this problem beyond the two methods I have described?

+1  A: 

Have you tried looking into monitoring software like SiteScope?

RedWolves
Not really what I'm looking for since I'm more interested in content generation errors than infrastructure. Perhaps an inability to connect to the database was a poor example.
Kevin Loney
+2  A: 

Why not simply allow them all to be sent out and then collect and store them in a database on the recipient end. That way you bypass the possibility of the database being the problem in the server.

Also, a greater advantage in my opinion, is that you don't arbitrarily throw out valuable forensic data. Post hoc analysis is very important and any kind of filtering could make it incredibly difficult, or impossible.

Allain Lalonde
+1  A: 

What i did was monitoring the error log, and sending a digest every 5 minutes. I'd like to think it's because of my high quality code (versus an unpopular app!), but i don't get hassled too much :P I basically read the log file from end to start, parse error messages, and stop when the timestamp < the last time i ran the job, then send a simple email.

This works well enough. However, if you use POST alot, there is a limited amount of information you could get from correlating the apache access log with your php error log. I remember reading about a module to log POSTs to a file from within apache, but don't remember the specifics.

However, if you're willing to use the error handler to write somewhere, that might be best as you've got access to much more information. ip, session id (and any user information, which might impact settings, like pagination or whatever), function arguments (debug_backtrace, or whatever it is) ... Write every error, just send messages when new errors occur, or after an error has been acknowledged (if you care to write such a system).

Leprechaun
A: 

You should go ahead and generate whatever log files you want. But instead of sending the emails yourself, hook the logs up to a monitoring system like Nagios. Let the monitoring solution decide when to alert the admins, and how often.

Chase Seibert