views:

27

answers:

1

Hi A software named G09 works in parallel using Linda. It spawns its parallel child processes on other nodes (hosts) as

/usr/bin/ssh -x compute-0-127.local -n /usr/local/g09l/g09/linda-exe/l1002.exel ...other_opts...

However, when the master node kills this process, the corresponding child process on other node, namely compute-0-127 does not die but keeps running in background. Right now, I manually go to each node which has these orphaned Linda processes and kill them with kill. Is there any way to kill such child processes?

Look at pastebin 1 for PSTREE before killing the process and at pastebin 2 for PSTREE after parent is killed
pastebin1 - http://pastebin.com/yNXFR28V
pastebin2 - http:// pastebin.com/ApwXrueh
-not enough reputation points for hyperlinking second pastebin, sorry !(
Update to Answer1
Thanks Martin for explaining. I tried following

killme() { kill 0 ; } ; #Make calls to prepare for running G09 ; 
g09 < "$g09inp" > "$g09out" &
trap killme 'TERM'
wait

but when Torque/Maui (which handles job execution) kills the job(this script) as qdel $jobid the processes started by G09 as ssh -x $host -n still run in the background. What am I doing wrong here ? (Normal termination is not a problem as G09 itself stops those processes.) Following is pstree before qdel

bash
|-461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
|   `-g09
|       `-l1002.exe 1048576000Pd-C-C-addn-H-MO6-fwd-opt.chk
|           `-cLindaLauncher/tmp/viaExecDataN6
|               |-l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   |-{l1002.exel}
|               |   `-{l1002.exel}
|               |-ssh -x compute-0-149.local -n ...
|               |-ssh -x compute-0-147.local -n ...
|               |-ssh -x compute-0-146.local -n ...
|               |-{cLindaLauncher}
|               `-{cLindaLauncher}
`-pbs_demux

and after qdel it still shows

461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
`-ssh -x -n compute-0-149 rm\040-rf\040/state/partition1/trirag09/461

l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
`-{l1002.exel}

ssh -x compute-0-149.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

ssh -x compute-0-147.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

ssh -x compute-0-146.local -n /usr/local/g09l/g09/linda-exe/l1002.exel

What am I doing wrong here ? is the trap killme 'TERM' wrong ?

+1  A: 

I would try the following approach:

  • create a script/application that wraps this g09 binary that you are starting, and start that wrapper instead
  • in the script, wait for the HUP signal to arrive (which should be received when the ssh connection is closed)
  • in processing the HUP signal, send a signal to your process group (i.e. PID 0) that kills all processes in the group.

Sending a KILL signal to the process group is really easy: kill -9 0. Try this:

#!/bin/sh
./b.sh 1 &
./b.sh 2 &
sleep 10
kill -9 0

where b.sh is

#!/bin/sh
while /bin/true
do
  echo $1
  sleep 1
done

You can have as many child processes as you want (directly or indirectly); they will all get the signal - as long as they don't detach themselves from the process group.

Martin v. Löwis
Thanks this is very much constructive and useful [now I only need to decipher it in codes, errm not an expert in bash scripting :( ] Can you please elaborate ? We use Torque-Maui (its a cluster/resource manager) which sends HUP signal to the script which calls G09 binary. So I already have wrapper script which calls G09 (the binary which starts the `ssh -x -n` processes)
Prince
Can someone please elaborate on bullet point 3
Prince
See my edit of how to kill a process group. You don't have to use SIGKILL, of course.
Martin v. Löwis
Thanks Martin, I have edited my question to reflect my attempts to implement your solution, however, it did not work, can you please look into the updated question?
Prince
I'm confused. The script you wrote: how is that related to the pstree output? In particular, what is 461.norma.iitb? If that's the script, there is no surprise it didn't execute the kill: it still keeps running, so apparently, it didn't receive SIGTERM (which, in turn, might not be surprising as you said SIGHUP will be sent, not SIGTERM).
Martin v. Löwis
Further comment: put some `touch /tmp/foo ` action into the trap function, to see whether it gets executed at all.
Martin v. Löwis