views:

87

answers:

2

Hi everybody,

We have a web server that we're about to launch a number of applications onto. They will all share database and memcached servers, but each application has it's own mySQL database and all memcached keys per application, is prefixed.

Possible scenario:

If a memcached server in our cluster goes boom, we want someone (operative system admin) to be automatically contacted by email/iphone push notification or in any other appropriate way.

If we we're about to install 150 identical applications for our customers on our servers, and a memcached server dies - all 150 applications will individually find this out and contact our system admin, which most certainly is going to think about getting a new job where he or she isn't about to be woken up by getting 150 messages sent 4:15 in the morning.

Possible solution:

One idea is to set up an external server for error handling that gets a $_POST or cURL request sent, and handles storage of the error message depending on the seriousness of the actual error message. It would of course check upon receiving the error call, that if the same memcached server have already been reported as offline, there would be no need to spam the system admin with additional reminders...

The questions:

  • What's a good approach on how to handle errors?
  • How does the big guys in the industry handle this?

Thanks!

+1  A: 

You might consider using an open source monitoring framework such as Hyperic so you don't need to reinvent the wheel.

Hyperic can monitor many aspects of your system out of the box and it's pretty easy to plug in new monitoring points. It provides rule based alerting and you can configure which types of alerts are once-only until reset vs. once each time it happens.

I have not used it to monitor a PHP app (though presume that it can), but have used it very successfully to monitor a java app and associated MySQL DB.

Eric J.
Hyperic looks super sweet. Will definitely check it out!
Industrial
Thanks for the tip, Eric. Hyperic is up and running now :) Thanks a lot!
Industrial
@Industrial: Glad it worked for you. We have been using it in production for around 18 months and quite happy.
Eric J.
+1  A: 

Well, I think your problem is best solved outside of the application.

You want to monitor physical and software servers/services. I'd recommend something like http://www.nagios.org/ or http://www.opennms.org/. Set it up to watch each memcached server, mysql server, apache, etc, and send notifications on state change (down, low resources, recovery, etc)

ircmaxell