views:

96

answers:

1

We use a cluster with Perceus (warewulf) software to do some computing. This software package has wwmpirun program (a Perl script) to prepare a hostfile and execute mpirun:

# ...
system("$mpirun -hostfile $tmp_hostfile -np $mpirun_np @ARGV");
# ...

We use this script to run a math program (CODE) on several nodes, and CODE is normally supposed to be stopped by Ctrl+C giving a short menu with options: status, stop, and halt. However, running with MPI, pressing Ctrl+C badly kills CODE with loss of data.

Developers of CODE suggest a workaround - the program can be stopped by creating a file with name stop%s, where %s is name of task-file being executed by CODE. This allows to stop, but we cannot get status of calculation. Sometimes it takes really long time and getting this function back would be very appreciated.

What do you think - the problem is in CODE or mpirun?

Can one find a way to communicate with CODE executed by mpirun?

UPDATE1

In single run, one gets status of calculation by pressing Ctrl+C and choosing option status in the provided menu by entering s. CODE prints status information in STDOUT and continues to do the calculation.

+1  A: 
  1. "we cannot get status of calculation" - what does that mean? do you expect to get the status somehow but are not? or is the software not designed to give you status?

    Your system call doesn't re-direct standard error/out anyplace, is that where the status is supposed to be (in which case, catch it by opening a pipe or re-directing to a log and having the wrapper read the log).

    Also, you're not processing the return code by evaluating the return value of system call - that may be another way the program communicates.

  2. Your Ctrl+C problem might be because Ctrl+C is caught by the Perl wrapper which dies instead of by the CODE which has some nice Ctrl+C interrupt handler. The solution might be to add interrupt handler to mpirun - see Perl Cookbook Recipe 16.18 for $SIG{INT} or http://www.wellho.net/resources/ex.php4?item=p216/sigint ; you may want to have the Perl wrapper catch Ctrl+C and send the INT signal to CODE it launched.

DVK
@DVK, I have updated my question concerning p.1. Yes, I believe I need to have an interrupt handler, but how I can pass it to CODE called via `system` and `mpirun`?
Andrey