views:

164

answers:

5

I've searched around but haven't quite found what I'm looking for. In a nutshell I have created a bash script to run in a infinite while loop, sleeping and checking if a process is running. The only problem is even if the process is running, it says it is not and opens another instance.

I know I should check by process name and not process id, since another process could jump in and take the id. However all perl programs are named Perl5.10.0 on my system, and I intend on having multiple instances of the same perl program open.

The following "if" always returns false, what am I doing wrong here???

while true; do

 if [ ps -p $pid ]; then
  echo "Program running fine"
  sleep 10

 else
  echo "Program being restarted\n"
  perl program_name.pl &
  sleep 5
  read -r pid < "${filename}_pid.txt"
 fi

done
+8  A: 

Get rid of the square brackets. It should be:

if ps -p $pid; then

The square brackets are syntactic sugar for the test command. This is an entirely different beast and does not invoke ps at all:

if test ps -p $pid; then

In fact that yields "-bash: [: -p: binary operator expected" when I run it.

John Kugelman
Awesome, I just started shell scripting today and it was driving me nuts. I'll take a simple syntax issue any day over something insanely complex though. Thanks again.PS: You answered so quick (I posted, went to the bathroom, and came back to a solution LOVE STACK OVERFLOW!) that I have to wait to check your answer as correct, but I will in a little while.
@user387049: John Kugelman's answer is good for your shell knowledge. However your script is a poor way of supervising a program; [msw's answer](http://stackoverflow.com/questions/3304559/how-to-use-a-shell-script-to-supervise-a-faulty-program/3304634#3304634) explains why. A much more robust solution would be to use an existing supervisor program, such as the ones mentioned by [Jonas](http://stackoverflow.com/questions/3304559/how-to-use-a-shell-script-to-supervise-a-faulty-program/3306727#3306727).
Gilles
A: 

That's what kill -0 $pid is for. It returns success if a process with pid $pid exists.

ninjalj
Problem with that command is that you can only check processes that are originating from your user, who runs the script.
Anders
@Anders: If your script is also responsible for restarting the process (as in the example above), then that's a reasonable assumption.
Greg Hewgill
+5  A: 

Aside from the syntax error already pointed out, this is a lousy way to ensure that a process stays alive.

First, you should find out why your program is dying in the first place; this script doesn't fix a bug, it tries to hide one.

Secondly, if it is so important that a program remain running, why do you expect your (at least once already) buggy shell script will do the job? Use a system facility that is specifically designed to restart server processes. If you say what platform you are using and the nature of your server process. I can offer more concrete advice.

added in response to comment:

Sure, there are engineering exigencies, but as the OP noted in the OP, there is still a bug in this attempt at a solution:

I know I should check by process name and not process id, since another process could jump in and take the id.

So now you are left with a PID tracking script, not a process "nanny". Although the chances are small, the script as it now stands has a ten second window in which

  1. the "monitored" process fails
  2. I start up my week long emacs process which grabs the same PID
  3. the nanny script continues on blissfully unaware that its dependent has failed

The script isn't merely buggy, it is invalid because it presumes that PIDs are stable identifiers of a process. There are ways that this could be better handled even at the shell script level. The simplest is to never detach the execution of perl from the script since the script is doing nothing other than watching the subprocess. For example:

while true ; do
    if perl program_name.pl ; then
         echo "program_name terminated normally, restarting"
    else
         echo "oops program_name died again, restarting"
    fi
done

Which is not only shorter and simpler, but it actually blocks for the condition that you are really interested in: the run-state of the perl program. The original script repeatedly checks a bad proxy indication of the run state condition (the PID) and so can get it wrong. And, since the whole purpose of this nanny script is to handle faults, it would be bad if it were faulty itself by design.

msw
First, one can't always fix the cause of the problem because it's out of ones hand. Secondly I agree, although, a well designed script can easily perform tasks such as this without being "buggy". But yes, one should always use existing functionality if it exists.
Anders
While fixing the underlying problem would be best, this seemed like an easier and more stable route. The errors I'm getting from Perl are either segmentation faults, inconsistently leaving scope, or editing a shared variable. Now these errors are inconsistent themselves, even if i start the program back up and run the same input file I may get no errors.Also my shell script is not buggy anymore, it was just that issue so I do fully expect it to work.
@user387049: it is still buggy, see my "added" above.
msw
Yes I actually had wanted to do this, but like I said before, I am running multiple instances of the same Perl program. Therefore I would be checking perl /directory/program_name.pl and not know which one died/needs to be restarted. I guess a way around that would be to place however many instances i need, say n, into n folders. That way each program name would have a different file correct? I could monitor it all from one daemon right?
you might not want to reinvent the wheel: http://mmonit.com/monit/ I've not used it myself, but I expect that it does what it says on the "box" - available at a repository near you.
msw
“PIDs are stable identifiers of a process”: well, that's true as long as the process is running, and in fact a little after — as long as the process is still a *zombie*, i.e., hasn't been *reaped* by its parent: the pid is not recycled until the parent has acknowledged that the child has died. This is detectable in the shell by remembering pids from `$!` and using the `wait` builtin. It's tricky to do right though. It would be a little easier in Perl, though still not as easily as using an existing supervisor.
Gilles
+2  A: 

I totally agree that fiddling with the PID is nearly always a bad idea. The while true ; do ... done script is quite good, however for production systems there a couple of process supervisors which do exactly this and much more, e.g.

  • enable you to send signals to the supervised process (without knowing it's PID)
  • check how long a service has been up or down
  • capturing its output and write it to a log file

Examples of such process supervisors are daemontools or runit. For a more elaborate discussion and examples see Init scripts considered harmful. Don't be disturbed by the title: Traditional init scripts suffer from exactly the same problem like you do (they start a daemon, keep it's PID in a file and then leave the daemon alone).

Jonas
+1  A: 

I agree that you should find out why your program is dying in the first place. However, an ever running shell script is probably not a good idea. What if this supervising shell script dies? (And yes, get rid of the square braces around ps -p $pid. You want the exit status of ps -p $pid command. The square brackets are a replacement for the test command.)

There are two possible solutions:

  1. Use cron to run your "supervising" shell script to see if the process you're supervising is still running, and if it isn't, restart it. The supervised process can output it's PID into a file. Your supervising program can then cat this file and get the PID to check.

  2. If the program you're supervising is providing a service upon a particular port, make it an inetd service. This way, it isn't running at all until there is a request upon that port. If you set it up correctly, it will terminate when not needed and restart when needed. Takes less resources and the OS will handle everything for you.

David W.