When attempting to run the first example in the boost::mpi tutorial, I was unable to run across more than two machines. Specifically, this seemed to run fine:
mpirun -hostfile hostnames -np 4 boost1
with each hostname in hostnames as <node_name> slots=2 max_slots=2
. But, when I increase the number of processes to 5, it just hangs. I have decreased the number of slots
/max_slots
to 1 with the same result when I exceed 2 machines. On the nodes, this shows up in the job list:
<user> Ss orted --daemonize -mca ess env -mca orte_ess_jobid 388497408 \
-mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -hnp-uri \
388497408.0;tcp://<node_ip>:48823
Additionally, when I kill it, I get this message:
node2- daemon did not report back when launched
node3- daemon did not report back when launched
The cluster is set up with the mpi
and boost
libs accessible on an NFS mounted drive. Am I running into a deadlock with NFS? Or, is something else going on?
Update: To be clear, the boost program I am running is
#include <boost/mpi/environment.hpp>
#include <boost/mpi/communicator.hpp>
#include <iostream>
namespace mpi = boost::mpi;
int main(int argc, char* argv[])
{
mpi::environment env(argc, argv);
mpi::communicator world;
std::cout << "I am process " << world.rank() << " of " << world.size()
<< "." << std::endl;
return 0;
}
From @Dirk Eddelbuettel's recommendations, I compiled and ran the mpi example hello_c.c
, as follows
#include <stdio.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello, world, I am %d of %d\n", rank, size);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
It runs fine on a single machine with multiple processes, this includes sshing into any of the nodes and running. Each compute node is identical with the working and mpi/boost directories mounted from a remote machine via NFS. When running the boost program from the fileserver (identical to a node except boost/mpi are local), I am able to run on two remote nodes. For "hello world", however, running the command mpirun -H node1,node2 -np 12 ./hello
I get
[<node name>][[2771,1],<process #>] \
[btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] \
connect() to <node-ip> failed: No route to host (113)
while the all of the "Hello World's" are printed and it hangs at the end. However, the behavior when running from a compute node on a remote node differs.
Both "Hello world" and the boost code just hang with mpirun -H node1 -np 12 ./hello
when run from node2 and vice versa. (Hang in the same sense as above: orted is running on remote machine, but not communicating back.)
The fact that the behavior differs from running on the fileserver where the mpi libs are local versus on a compute node suggests that I may be running into an NFS deadlock. Is this a reasonable conclusion? Assuming that this is the case, how do I configure mpi to allow me to link it statically? Additionally, I don't know what to make of the error I get when running from the fileserver, any thoughts?