views:

142

answers:

3

Hi,

We are launching hadoop cluster on amazon ec2 and recently we are having network issues like master unable to connect to slave. We thought the reason is due to amazon throttling the network connections over a limit. So, we tried to establish a connection after a random delay from each slave node. But, that didn't help.

Are there any other suggestions?

Thank you Bala

A: 

Have you tried using the hadoop-ec2 scripts from cloudera? I've been using them for setting up occasional hadoop clusters for my thesis research and I've found them to work quite well. The setup takes a few minutes but once it's setup you just do

hadoop-ec2 launch-cluster <clustername> <number of slaves>

and it setups all the stuff you need, and usually does a really good job. Occasionally, a node won't startup or something, but it's easy enough to terminate the cluster and try again, and it doesn't cost too much.

You can find the instructions for setting them up here:

http://archive.cloudera.com/docs/ec2.html
Paul Huff
Actually, we are using cloudera scripts for launching cluster. The problem we are facing is, how to identify if there is an network issue because, we run our processing during the night and we are unable to understand if thats a network failure. Do you have any suggestions on this?
Algorist
A: 

Do you have the right ports open in the security group that your cluster instances use ? I'm not familiar with Hadoop, but if it uses a custom TCP/IP or UDP port for communication between nodes, then you'll need to specify it in your security group.

gareth_bowles
Actually all the nodes in the cluster belong to the same security group, so they can communicate.
Algorist
A: 

Using Amazon Elastic MapReduce would alleviate many issues and provide some IO boosts to S3 and between nodes as well as a few AWS specific patches to improve robustness.

Its probably wise to stay away from the EC2 cluster scripts unless you need a specific version of Hadoop, but you really shouldn't.

cwensel
I am still worried about EMR because, it uses S3 and the basic idea of map reduce is to move the computation to the data. S3 is a distributed data storage and I feel that its not right for us. Do you have any reason, why you believe S3 is better in case of EMR compared to EBS, we are currently using.
Algorist
don't forget that EMR also costs slightly more
matpalm