views:

407

answers:

5

I have a web application, written in Coldfusion, which periodically starts using 100% of the server's CPU and crashes the Coldfusion service.

Since I have been unable to reproduce the problem myself, I'd like to find a utility which will notify me by email when the CPU usage begins to climb, so I can hop on the server, look at FusionReactor to see what's going on, and identify the misbehaving code.

I have Googled and have been unable to find a suitable utility, so I thought I'd ask whether other programmers have found a tool which can do this kind of monitoring. Given the specifics of my needs, I'd prefer not to write my own monitoring tool.

If you have other suggestions for approaching the overall problem, I'd love to hear those also.

Responses to answers:

Using Windows perfmon to trigger a command line sounds promising. Can anyone point me to a tutorial on how to do that?

We don't have a monitoring system that is set up to receive SNMP traps.

We're running Coldfusion 8 Standard Edition, which doesn't include the Coldfusion performance monitoring tools-- only Enterprise includes those.

Peter Boughton's answer will probably enable me to solve the problem, but it doesn't help me address the problem proactively as CPU starts to spike, so I'm still looking for a notification solution.

+2  A: 

On Windows, you can use the performance and monitoring tools (perfmon) that come standard with Windows. When the CPU reaches a certain point, it can trigger an SNMP notice which can be picked up by most system monitoring tools and alert you that way. It can also run a command or do a variety of other useful things that might help you nail it down.

Can you give more details on having perfmon run a command? I think that would meet my needs.
davidcl
+1  A: 

Another alternative is available to you if you are running Coldfusion 8 - the performance monitor. You can set up alerts to send you an email, call another CFC etc. for different criteria.

I use the server monitor a lot - it's a great tool even though I haven't used the alerts myself too much they are probably well worth looking into as they look easy to setup.

As for overall approaches...

In our company we use Windows Perfmon from one server to poll all our CF servers every 30 seconds to get some metrics such as total CPU and average response time. We log this to a CSV file. Every few minutes a scheduled task runs which reads the detail of these files and saves them to a DB table. It then truncates the files so they don't get too big.

Finally then, once a day, we get a report sent detailing all server metrics to our tech department that graphs the CPU and average response time for all our servers.

We find this is a great way to keep server performance in the developer conciousness and spot trends (such as poor CPU utilisation) early. We have found it very effective because you can't know if you are performing badly until you start to measure your metrics.

Ciaran Archer
+1  A: 

You don't necessarily need this notification as it happens - FusionReactor has log files, so you can check these after it has happened, and identify the scripts running at the time.

And if you're not sure when it is happening, I'm fairly sure one of the logs contains memory and CPU usage - though I can't recall what the names are right now, but have a poke about in the logs and you should find the relevant info.

Peter Boughton
Good point. I may need to expand FR's logging a bit to make sure the request logs are retained long enough to correlate them to the resource logs, but that's an excellent point.
davidcl
+3  A: 

I had a similar problem a few weeks ago and was directed to a program called AlwaysUp.

http://www.coretechnologies.com/products/AlwaysUp/

Monitors any process by any combination of:

  • Memory useage
  • Cpu Usage
  • Unresponsiveness
  • Run a custom script to determine if it should be reset

From here you can either restart the service, or send notification emails, or run a script to log things, etc.

They have a 30 day demo, I was sold on the 2nd day.. A good quick fix while you get to the bottom of the issue.

Jas Panesar
Thanks, that sounds like what I'm looking for to solve this persistent but occasional problem.
davidcl
No Problem. I hate hack jobs but this solved a problem for me too. If it ever makes itself into a bigger problem I'll give it the attention it is due.
Jas Panesar
A: 

The information you're looking for is in the resource logs with resource-0.log being the most recent. They are polled on a 5s interval although this is configurable in the interface.

The help files list what the fields are in the log file. You can also view the help online.

David