views:

140

answers:

3

Hello everyone, i have a few processes that have to be run at high priority (chrt 98) that will occasionally decide to hard-lock and peg 1 core at 100% (not a huge deal) but more importantly it will use all the IO on a system, so much that its impossible to log into the machine via ssh to kill it or perform any task on the machine that isn't loaded into ram. If i happen to have something like htop already running i am able to end the process fine. Is there any type of utility/way to monitor for this type of runaway process and kill anything that uses 100% of system IO for more than X amount of time? Thanks!

A: 

Can't you start the program with nice (and with a lower priority)? This way at least you should be able to ssh into the box and kill it easily.

The better solution would off course be to fix the behaviour of the offending process (details needed).

This serverfault thread also seems to contain what you ask for specifically.

ChristopheD
thank you for the reply! Unfortunately theres no way for me to fix the behavior as the cause of the crashes change with every update/random plugin that's installed. The thread you linked was actually one i started back before i realized the problem was IO related, unfortunately monitoring cpu usage alone probably isn't going to work out because monit isn't accurate enough to determine the exact cpu usage on a multi-core machine. I figure ionice may help a bit but i was hoping there was some file system usage monitoring tool i was unaware of that could help with this problem.
bleomycin
A: 

Assuming that it's disk IO that the app is consuming, can you just move the filesystems it's accessing onto separate disks? That way you'll have IO to spare on the disks which the OS is installed on, and should be able to log in and manage (i.e. kill!) the process.

Chris May
A: 

As another poster said, running your process with nice is the way to go, but you did mention that you want to run it at a high priority, which is odd... be aware that if you're running a process at the highest priority and it's pegged, your monitoring system might not even be able to kill it, unless your monitor is at a higher priority still. Anyway....

god, as well as several other process managment tools, can easily kill a process if it's misbehaving in any of several ways.. config looks like this - you set checks at a particular interval, and then you can say "after five checks, nuke it if it's been above 98% CPU usage consistently":

  restart.condition(:cpu_usage) do |c|
    c.above = 98.percent
    c.times = 5
  end

Another, different take that you might have a look at is chpst from the runit system - it allows you to elegantly set bounds on things (but for CPU limiting, nice is still the tool I'd reach for first).

bpo