Hello everyone, i have a few processes that have to be run at high priority (chrt 98) that will occasionally decide to hard-lock and peg 1 core at 100% (not a huge deal) but more importantly it will use all the IO on a system, so much that its impossible to log into the machine via ssh to kill it or perform any task on the machine that isn't loaded into ram. If i happen to have something like htop already running i am able to end the process fine. Is there any type of utility/way to monitor for this type of runaway process and kill anything that uses 100% of system IO for more than X amount of time? Thanks!
Can't you start the program with nice
(and with a lower priority)? This way at least you should be able to ssh into the box and kill it easily.
The better solution would off course be to fix the behaviour of the offending process (details needed).
This serverfault thread also seems to contain what you ask for specifically.
Assuming that it's disk IO that the app is consuming, can you just move the filesystems it's accessing onto separate disks? That way you'll have IO to spare on the disks which the OS is installed on, and should be able to log in and manage (i.e. kill!) the process.
As another poster said, running your process with nice
is the way to go, but you did mention that you want to run it at a high priority, which is odd... be aware that if you're running a process at the highest priority and it's pegged, your monitoring system might not even be able to kill it, unless your monitor is at a higher priority still. Anyway....
god, as well as several other process managment tools, can easily kill a process if it's misbehaving in any of several ways.. config looks like this - you set checks at a particular interval, and then you can say "after five checks, nuke it if it's been above 98% CPU usage consistently":
restart.condition(:cpu_usage) do |c|
c.above = 98.percent
c.times = 5
end
Another, different take that you might have a look at is chpst
from the runit system - it allows you to elegantly set bounds on things (but for CPU limiting, nice
is still the tool I'd reach for first).