ansaurus

Question

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.

Answer 1

+1 A:

You probably want to use s3n:// urls, not s3:// urls. s3n:// means "A regular file, readable from the outside world, at this S3 url". s3:// refers to an HDFS file system mapped into an S3 bucket.

To avoid the URL escaping issue for the access key (and to make life much easier), put them into the /etc/hadoop/conf/core-site.xml file:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>0123458712355</value>
</property>
<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>hi/momasgasfglskfghaslkfjg</value>
</property>
<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>0123458712355</value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>hi/momasgasfglskfghaslkfjg</value>
</property>

There was at one point an outstanding issue with secret keys that had a slash -- the URL was decoded in some contexts but not in others. I don't know if it's been fixed, but I do know that with the keys in the .conf this goes away.

Other quickies:

You can most quickly debug your problem using the hadoop filesystem commands, which work just fine on s3n:// (and s3://) urls. Try hadoop fs -cp s3n://myhappybucket/ or hadoop fs -cp s3n://myhappybucket/happyfile.txt /tmp/dest1 and even hadoop fs -cp /tmp/some_hdfs_file s3n://myhappybucket/will_be_put_into_s3
The distcp command runs a mapper-only command to copy a tree from there to here. Use it if you want to copy a very large number of files to the HDFS. (For everyday use, hadoop fs -cp src dest works just fine).
You don't have to move the data to the HDFS if you don't want. You can pull all the source data straight from s3, do all further manipulations targeting either the HDFS or S3 as you see fit.
Hadoop can become confused if there is a file s3n://myhappybucket/foo/bar and a "directory" (many files with keys s3n://myhappybucket/foo/bar/something). Some old versions of the s3sync command would leave just such 38-byte turds in the S3 tree.
If you start seeing SocketTimeoutException's, apply the patch for HADOOP-6254. We were, and we did, and they went away.

mrflip 2010-06-13 18:30:47

Answer 2

A:

@mrflip - Thanks for the comments. I tried editing the core-site.xml inside $HADOOP_HOME/conf/ directory. And when i run

hadoop fs -cp s3n://deepak-sample-jar/Twister-Dijstra-0.8.jar source

it throws

cp: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

There is something else i would like to mention. I am using

HADOOP_VERSION = hadoop-0.19.0

in my $HADOOP_HOME/src/contrib/ec2/bin/hadoop-ec2-env.sh (on my workstation). And so, on the cluster, inside /usr/local/hadoop-0.19/conf, there is no core-site.xml. There is hadoop-site.xml instead and i added

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey

in both the files on all the machines of the cluster (?). But still the problem persists.

I don't know if there is a problem with hadoop-0.19.0. I tried changing HADOOP_VERSION to 0.20.2 but it wouldn't let me login into the cluster. When i run

./hadoop-ec2 launch-cluster <cluster-name> <no.of.nodes>

It throws,

required parameter AMI missing.

Please point me out to a good resource/tutorial on running Map/reduce jobs on a EC2 cluster using Hadoop if you could.

Thanks, Deepak.

Deepak Konidena 2010-06-15 16:14:13

mrflip 2010-06-20 00:24:29

Answer 3

+1 A:

Try using Amazon Elastic MapReduce. It removes the need for configuring the hadoop nodes, and you can just access objects in your s3 account in the way you expect.

Ben Hardy 2010-06-15 22:05:55

@Ben Hardy - Any good beginner's resources?

Deepak Konidena 2010-06-16 02:39:56

@Deepak try this out, there is a lot of information here. https://aws.amazon.com/documentation/elasticmapreduce/

Ben Hardy 2010-06-17 00:25:46

ansaurus

tags:

views:

answers:

Problem with copying local data onto HDFS on a Hadoop cluster using Amazon EC2/ S3.

related questions