views:

77

answers:

3

Hi,

I am learning OpenMPI on a cluster. Here is my first example. I expect the output would show response from different nodes, but they all respond from the same node node062. I just wonder why and how I can actually get report from different nodes to show MPI actually is distributing processes to different nodes? Thanks and regards!

ex1.c

/* test of MPI */  
#include "mpi.h"  
#include <stdio.h>  
#include <string.h>  

int main(int argc, char **argv)  
{  
char idstr[2232]; char buff[22128];  
char processor_name[MPI_MAX_PROCESSOR_NAME];  
int numprocs; int myid; int i; int namelen;  
MPI_Status stat;  

MPI_Init(&argc,&argv);  
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);  
MPI_Comm_rank(MPI_COMM_WORLD,&myid);  
MPI_Get_processor_name(processor_name, &namelen);  

if(myid == 0)  
{  
  printf("WE have %d processors\n", numprocs);  
  for(i=1;i<numprocs;i++)  
  {  
    sprintf(buff, "Hello %d", i);  
    MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }  
    for(i=1;i<numprocs;i++)  
    {  
      MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);  
      printf("%s\n", buff);  
    }  
}  
else  
{   
  MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);  
  sprintf(idstr, " Processor %d at node %s ", myid, processor_name);  
  strcat(buff, idstr);  
  strcat(buff, "reporting for duty\n");  
  MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);  
}  
MPI_Finalize();  

}  

ex1.pbs

#!/bin/sh  
#  
#This is an example script example.sh  
#  
#These commands set up the Grid Environment for your job:  
#PBS -N ex1  
#PBS -l nodes=10:ppn=1,walltime=1:10:00  
#PBS -q dque    

# export OMP_NUM_THREADS=4  

 mpirun -np 10 /home/tim/courses/MPI/examples/ex1  

compile and run:

[tim@user1 examples]$ mpicc ./ex1.c -o ex1   
[tim@user1 examples]$ qsub ex1.pbs  
35540.mgt  
[tim@user1 examples]$ nano ex1.o35540  
----------------------------------------  
Begin PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883  
Job ID:         35540.mgt  
Username:       tim  
Group:          Brown  
Nodes:          node062 node063 node169 node170 node171 node172 node174 node175  
node176 node177  
End PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883  
----------------------------------------  
WE have 10 processors  
Hello 1 Processor 1 at node node062 reporting for duty  
Hello 2 Processor 2 at node node062 reporting for duty        
Hello 3 Processor 3 at node node062 reporting for duty        
Hello 4 Processor 4 at node node062 reporting for duty        
Hello 5 Processor 5 at node node062 reporting for duty        
Hello 6 Processor 6 at node node062 reporting for duty        
Hello 7 Processor 7 at node node062 reporting for duty        
Hello 8 Processor 8 at node node062 reporting for duty        
Hello 9 Processor 9 at node node062 reporting for duty  

----------------------------------------  
Begin PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891  
Job ID:         35540.mgt  
Username:       tim  
Group:          Brown  
Job Name:       ex1  
Session:        15533  
Limits:         neednodes=10:ppn=1,nodes=10:ppn=1,walltime=01:10:00  
Resources:      cput=00:00:00,mem=420kb,vmem=8216kb,walltime=00:00:03  
Queue:          dque  
Account:  
Nodes:  node062 node063 node169 node170 node171 node172 node174 node175 node176  
node177  
Killing leftovers...  

End PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891  
----------------------------------------

UPDATE:

I would like to run several background jobs in a single PBS script, so that the jobs can run at the same time. e.g. in the above example, I added another call to run ex1 and change both runs to be background in ex1.pbs

#!/bin/sh  
#  
#This is an example script example.sh  
#  
#These commands set up the Grid Environment for your job:  
#PBS -N ex1  
#PBS -l nodes=10:ppn=1,walltime=1:10:00  
#PBS -q dque 

echo "The first job starts!"  
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &  
echo "The first job ends!"  
echo "The second job starts!"  
mpirun -np 5 --machinefile /home/tim/courses/MPI/examples/machinefile /home/tim/courses/MPI/examples/ex1 &  
echo "The second job ends!" 

(1) The result is fine after qsub this script with previous compiled executable ex1.

The first job starts!  
The first job ends!  
The second job starts!  
The second job ends!  
WE have 5 processors  
WE have 5 processors  
Hello 1 Processor 1 at node node063 reporting for duty        
Hello 2 Processor 2 at node node169 reporting for duty        
Hello 3 Processor 3 at node node170 reporting for duty        
Hello 1 Processor 1 at node node063 reporting for duty        
Hello 4 Processor 4 at node node171 reporting for duty        
Hello 2 Processor 2 at node node169 reporting for duty        
Hello 3 Processor 3 at node node170 reporting for duty        
Hello 4 Processor 4 at node node171 reporting for duty  

(2) However, I think the running time of ex1 is too quick and probably the two background jobs do not have much running time overlapping, which is not the case when I apply the same way to my real project. So I added sleep(30) to ex1.c to extend the running time of ex1 so that two jobs running ex1 in background will be running simultaneously almost all the time.

/* test of MPI */  
#include "mpi.h"  
#include <stdio.h>  
#include <string.h>  
#include <unistd.h>

int main(int argc, char **argv)  
{  
char idstr[2232]; char buff[22128];  
char processor_name[MPI_MAX_PROCESSOR_NAME];  
int numprocs; int myid; int i; int namelen;  
MPI_Status stat;  

MPI_Init(&argc,&argv);  
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);  
MPI_Comm_rank(MPI_COMM_WORLD,&myid);  
MPI_Get_processor_name(processor_name, &namelen);  

if(myid == 0)  
{  
  printf("WE have %d processors\n", numprocs);  
  for(i=1;i<numprocs;i++)  
  {  
    sprintf(buff, "Hello %d", i);  
    MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }  
    for(i=1;i<numprocs;i++)  
    {  
      MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);  
      printf("%s\n", buff);  
    }  
}  
else  
{   
  MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);  
  sprintf(idstr, " Processor %d at node %s ", myid, processor_name);  
  strcat(buff, idstr);  
  strcat(buff, "reporting for duty\n");  
  MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);  
}  

sleep(30); // new added to extend the running time
MPI_Finalize();  

}  

But after recompilation and qsub again, the results seems not all right. There are processes aborted. in ex1.o35571:

The first job starts!  
The first job ends!  
The second job starts!  
The second job ends!  
WE have 5 processors  
WE have 5 processors  
Hello 1 Processor 1 at node node063 reporting for duty  
Hello 2 Processor 2 at node node169 reporting for duty  
Hello 3 Processor 3 at node node170 reporting for duty  
Hello 4 Processor 4 at node node171 reporting for duty  
Hello 1 Processor 1 at node node063 reporting for duty  
Hello 2 Processor 2 at node node169 reporting for duty  
Hello 3 Processor 3 at node node170 reporting for duty  
Hello 4 Processor 4 at node node171 reporting for duty  
4 additional processes aborted (not shown)  
4 additional processes aborted (not shown)  

in ex1.e35571:

mpirun: killing job...  
mpirun noticed that job rank 0 with PID 25376 on node node062 exited on signal 15 (Terminated).  
mpirun: killing job...  
mpirun noticed that job rank 0 with PID 25377 on node node062 exited on signal 15 (Terminated).  

I wonder why there are processes aborted? How can I qsub background jobs correctly in a PBS script?

+2  A: 

hi

couple things: you need to tell mpi where to launch processes, assuming you are using mpich, look at mpiexec help section and find machine file or equivalent description. Unless machine file is provided, it will run on one host

PBS automatically creates nodes file. Its name is stored in PBS_NODEFILE environment variable available in PBS command file. Try the following:

mpiexec -machinefile $PBS_NODEFILE ...

if you are using mpich2, you have two boot your mpi runtime using mpdboot. I do not remember the details of command, you will have to read man page. Remember to create secret file otherwise mpdboot will fail.

I read your post again, you will use open mpi, you still have to supply machines file to mpiexec command, but you do not have to mess with mpdboot

aaa
Thank you so much! That solved my problem. If someone else have reserved some nodes for themselves or some nodes are now running other jobs, will mpiexec provided with $PBS_NODEFILE as machinefile notice that? Also could you try to answer my questions posted on http://superuser.com/questions/102812/torch-in-cluster about using PBS on a cluster? Thanks in advance!
Tim
+1  A: 

Hi

As a diagnostic, try inserting these statements immediately after your call to MPI_GET_PROCESSOR_NAME.

printf("Hello, world.  I am %d of %d on %s\n", myid, numprocs, name);
fflush(stdout); 

If all processes return the same node id to that, it would suggest to me that you don't quite understand what is going on on the job management system and cluster -- perhaps PBS is (despite you apparently telling it otherwise) putting all 10 processes on one node (do you have 10 cores in a node ?).

If this produces different results, that suggests to me something wrong with your code, though it looks OK to me.

Regards

Mark

High Performance Mark
Thanks Mark. I am not familiar with PBS either and now learning it too. If possible, could you try to answer my questions about PBS/Torch at http://superuser.com/questions/102812/torch-in-cluster ?
Tim
Sorry Tim, I'm a Grid Engine user, only the vaguest knowledge of PBS.
High Performance Mark
+1  A: 

hi

By default PBS (I am assuming torque) allocates nodes in exclusive mode, so that only one job per node. It is a bit different if you have multiple processors, most likely one process per CPU. PBS can be changed to allocate nod in time-sharing mode, look at man page of qmgr.long story short, most likely you will not have overlapping nodes in node file, since node file is created when resources are available rather than at time of submission.

the purpose of PBS is resource control, most commonly time, node allocation (automatic).

commands in PBS file are executed sequentially. You can put processes in background, but that might be defeating purpose of resource allocation, but I do not know your exact workflow. I used the background processes in PBS scripts to copy data before main program runs in parallel, using &. PBS script is actually just a shell script.

you can assume that PBS does not know anything about inner workings off your script. You can certainly run multiple processes/threads in the via script.if you do so, that is up to you and your operating system to allocate core/processors in balanced fashion. If you are on multithreaded program, most likely approach is to run one mpi process for node and then spawn OpenMP threads.

Let me know if you need clarifications

aaa
Thanks again! In my real project, I would like to run a executable several times and all in background in a single PBS script. However when I tried this way on the simple example gave above, there are some processes aborted. Please see my update to the original post. How to run background jobs in a PBS script? Thanks in advance!
Tim
@Timhard to tell, looks like mpi job is being terminated externally. How much time do you allocate in PBS script, it is my first guess
aaa
walltime is set to be 1:10:00. I think that is long enough for the running with sleep(30). Do you see anything wrong about specifying background jobs in the PBS script?
Tim
@Timtim, probably have to assume first mpi launch is being killed by the second.I know less about OpenMPI than about mpich, but perhaps the default in OpenMPI is to launch only single mpi instance per node. Here is which can try: split node file such that to mpi launches do not have same nodes. If that works, look through open mpi manual to see how-to run multiple jobs on the same machine
aaa