ansaurus

Question

How do I write a bash script to restart a process if it dies?

Answer 1

+2 A:

The easiest way to do it is using flock on file. In Python script you'd do

lf = open('/tmp/script.lock','w')
if(fcntl.flock(lf, fcntl.LOCK_EX|fcntl.LOCK_NB) != 0) 
   sys.exit('other instance already running')
lf.write('%d\n'%os.getpid())
lf.flush()

in shell you can actually test is its running:

if [ `flock -xn /tmp/script.lock -c 'echo 1'` ]; then 
   echo 'it's not running'
   restart.
else
   echo -n 'it's already running with PID '
   cat /tmp/script.lock
fi

but of course you don't have to test, because if it's already running and you restart it, it'll exit with 'other instance already running'

When process dies, all it's file descriptors are closed and all locks are automatically removed.

vartec 2009-03-30 11:32:47

that could conceivably simplify it a bit by removing the bash script. what happens if the python script crashes? is the file unlocked?

Tom 2009-03-30 11:46:08

File lock is released as soon as the application stops, either by killing, naturally or crashing.

Christian Witts 2009-03-30 11:54:47

Answer 2

+3 A:

if ! test -f $PIDFILE || ! psgrep `cat $PIDFILE`; then
    restart_process
    # Write PIDFILE
    echo $! >$PIDFILE
fi

soulmerge 2009-03-30 11:34:05

cool, that's fleshing out some of my pseudo code pretty well. two qns: 1) how do I generate PIDFILE? 2) what's psgrep? it's not on ubuntu server.

Tom 2009-03-30 11:43:40

ps grep is just a small app that does the same as `ps ax|grep ...`. You can just install it or write a function for that:function psgrep() {ps ax|grep -v grep|grep -q "$1"}

soulmerge 2009-03-30 11:46:07

Just noticed that I hadn't answered your first question.

soulmerge 2009-03-30 12:12:21

On really busy server it's possible that PID will get recycled before you check.

vartec 2009-03-30 12:20:29

Answer 3

+2 A:

You should use monit, a standard unix tool that can monitor different things on the system and react accordingly.

From the docs: http://mmonit.com/monit/documentation/monit.html#pid_testing

check process checkqueue.py with pidfile /var/run/checkqueue.pid
       if changed pid then exec "checkqueue_restart.sh"

You can also configure monit to email you when it does do a restart.

clofresh 2009-03-30 12:19:59

Answer 4

+1 A:

Have a look at monit (http://mmonit.com/monit/). It hadles start, stop and restart of your script and can do health checks plus restarts if necessary.

Or do a simple script

while 1
do
/your/script
sleep 1
done

Bernd 2009-03-30 12:39:02

Answer 5

+19 A:

Avoid PID-files, crons, or anything else that tries to evaluate processes that aren't their children.

There is a very good reason why in UNIX, you can ONLY wait on your children. Any method (ps parsing, pgrep, storing a PID, ...) that tries to work around that is flawed and has gaping holes in it. Just say no.

Instead you need the process that monitors your process to be the process' parent. What does this mean? It means only the process that starts your process can reliably wait for it to end. In bash, this is absolutely trivial.

until myserver; do
    echo "Server 'myserver' crashed with exit code $?.  Respawning.." >&2
    sleep 1
done

The above piece of bash code runs myserver in an until loop. The first line starts myserver and waits for it to end. When it ends, until checks its exit status. If the exit status is 0, it means it ended gracefully (which means you asked it to shut down somehow, and it did so successfully). In that case we don't want to restart it (we just asked it to shut down!). If the exit status is not 0, until will run the loop body, which emits an error message on STDERR and restarts the loop (back to line 1) after 1 second.

Why do we wait a second? Because if something's wrong with the startup sequence of myserver and it crashes immediately, you'll have a very intensive loop of constant restarting and crashing on your hands. The sleep 1 takes away the strain from that.

Now all you need to do is start this bash script (asynchronously, probably), and it will monitor myserver and restart it as necessary.

Alternatively; look at inittab(5) and /etc/inittab. You can add a line in there to have myserver start at a certain init level and be respawned automatically.

Edit.

Let me add some information on why not to use PID files. While they are very popular; they are also very flawed and there's no reason why you wouldn't just do it the correct way.

Considder this:

PID recycling (killing the wrong process):
- /etc/init.d/foo start: start foo, write foo's PID to /var/run/foo.pid
- A while later: foo dies somehow.
- A while later: any random process that starts (call it bar) takes a random PID, imagine it taking foo's old PID.
- You notice foo's gone: /etc/init.d/foo/restart reads /var/run/foo.pid, checks to see if it's still alive, finds bar, thinks it's foo, kills it, starts a new foo.
PID files go stale. You need over-complicated (or should I say, non-trivial) logic to check whether the PID file is stale, and any such logic is again vulnerable to 1..
What if you don't even have write access or are in a read-only environment?
It's pointless overcomplication; see how simple my example above is. No need to complicate that, at all.

By the way; even worse than PID files is parsing ps! Don't ever do this.

ps is very unportable. While you find it on almost every UNIX system; its arguments vary greatly if you want non-standard output. And standard output is ONLY for human consumption, not for scripted parsing!
Parsing ps leads to a LOT of false positives. Take the ps aux | grep PID example, and now imagine someone starting a process with a number somewhere as argument that happens to be the same as the PID you stared your daemon with! Imagine two people starting an X session and you grepping for X to kill yours. It's just all kinds of bad.

If you don't want to manage the process yourself; there are some perfectly good systems out there that will act as monitor for your processes. Look into runit, for example.

lhunath 2009-03-30 12:53:53

You might add some code to send a message or stop the loop if it restarts too many times in a short period of time.

Chas. Owens 2009-03-30 13:40:49

+1 most correct answer. But you are somewhat too pragmatic about pid files... SysV init scripts are based heavily on pid files, mostly because the start and stop states may be in different pgids.

Juliano 2009-03-30 23:02:59

@Chas. Ownes: I don't think that's necessary. It would just complicate the implementation for no good reason. Simplicity is always more important; and if it restarts often, the sleep will keep it from having any bad impact on your system resources. There is already a message anyway.

lhunath 2009-03-31 06:22:56

@Juliano: I know PID files are used everywhere. It doesn't mean they're not just as flawed as they were before.Start foo, put its PID in foo.pid.Foo dies.Something else gets started somewhere, takes a random PID which happens to be the one foo *had*.Stopping foo will kill the wrong process!

lhunath 2009-03-31 06:27:08

Answer 6

A:

I've used the following script with great success on numerous servers:

pid=`jps -v | grep $INSTALLATION | awk '{print $1}'`
echo $INSTALLATION found at PID $pid 
while [ -e /proc/$pid ]; do sleep 0.1; done

notes:

It's looking for a java process, so I can use jps, this is much more consistent across distributions than ps
$INSTALLATION contains enough of the process path that's it's totally unambiguous
Use sleep while waiting for the process to die, avoid hogging resources :)

This script is actually used to shut down a running instance of tomcat, which I want to shut down (and wait for) at the command line, so launching it as a child process simply isn't an option for me.

Kevin Wright 2010-05-24 09:47:19

ansaurus

tags:

views:

answers:

How do I write a bash script to restart a process if it dies?

related questions