For my current web development project I'm implementing a back end system that will flag errors and send an email to the administrator automatically with details about what occurred. Trapping the error and generating the email with appropriate error information is pretty straight forward; but a problem arises when one considers certain groups of error types especially if the site is being visited frequently.
Consider a couple of examples:
- An unplanned database outage that prevents all of the scripts on the web server from being able to connect. If it takes say 2 minutes (120 seconds) for the database server to come back online, and the web server is receiving unique requests at a rate of 10/second, in the time it takes the database server to come back online the admins email would be flooded with 1200 identical emails all screaming about a failure to connect to the database.
- A bug in a script somewhere managed to sneak by testing and is of the variety that completely screws up content generation and occurs only in a specific set of circumstances (say once every 100 requests). Using the unique request rate of 10/second again means the administrator is going to be getting the same email every 10 seconds about the same bug until it is fixed.
What are some approaches/strategies I can use to prevent this scenario from occurring? (I am only interested in monitoring of errors generated by the script, infrastructure issues are beyond the scope of this solution)
I going to assume that I can almost always uniquely identify errors using a digest of some of the values passed to the error handler callback set by set_error_handler.
The first and probably most obvious solution is recording in a database and only send the email if a reasonable minimum period of time has passed since it last occurred. This isn't the ideal approach especially if the database is causing the problem. Another solution would be to write files to disk when errors occur and check if a reasonable minimum period of time has passed since the file was last modified. Is there any mechanism to solve this problem beyond the two methods I have described?