We use a cluster with Perceus (warewulf) software to do some computing. This software package has wwmpirun
program (a Perl script) to prepare a hostfile and execute mpirun
:
# ...
system("$mpirun -hostfile $tmp_hostfile -np $mpirun_np @ARGV");
# ...
We use this script to run a math program (CODE) on several nodes, and CODE is normally supposed to be stopped by Ctrl+C giving a short menu with options: status, stop, and halt. However, running with MPI, pressing Ctrl+C badly kills CODE with loss of data.
Developers of CODE suggest a workaround - the program can be stopped by creating a file with name stop%s
, where %s
is name of task-file being executed by CODE. This allows to stop, but we cannot get status of calculation. Sometimes it takes really long time and getting this function back would be very appreciated.
What do you think - the problem is in CODE or mpirun
?
Can one find a way to communicate with CODE executed by mpirun
?
UPDATE1
In single run, one gets status of calculation by pressing Ctrl+C and choosing option status
in the provided menu by entering s
. CODE prints status information in STDOUT and continues to do the calculation.