views:

988

answers:

3

I may be wrong, but all(?) examples I've seen with Apache Hadoop takes as input a file stored on the local file system (e.g. org.apache.hadoop.examples.Grep)

Is there a way to load and save the data on the Hadoop file system (HDFS)? For example I put a tab delimited file named 'stored.xls' on HDFS using hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls. How should I configure the JobConf to read it ?

Thanks .

A: 
JobConf conf = new JobConf(getConf(), ...);
...
FileInputFormat.setInputPaths(conf, new Path("stored.xls"))
...
JobClient.runJob(conf);
...

setInputPaths will do it.

yogman
Thanks, but it throws an exception saying that "file:/home/me/workspace/HADOOP/stored.xls" (this is a local path) doesn't exist. The file on HDFS is in '/user/me/stored.xls'. I also tried new Path("/user/me/stored.xls") and it doesn't work too.
Pierre
First off, it's strange that Hadoop complained about "file:" rather than "hdfs:". If might be that your hadoop-site.xml is misconfigured. And, second, if that still doesn't work, mkdir input and put stored.xls in the "input" dir (all with bin/hadoop fs command). And, new Path("input") instead of new Path("stored.xls")
yogman
Revealing your command line to run the job wouldn't hurt.
yogman
A: 

Pierre, the default configuration for Hadoop is to run in local mode, rather than in distributed mode. You likely need to just modify some configuration in your hadoop-site.xml. It looks like your default filesystem is still localhost, when it should be hdfs://youraddress:yourport. Look at your setting for fs.default.name, and also see the setup help at Michael Noll's blog for more details.

Kevin Weil
A: 

FileInputFormat.setInputPaths(conf, new Path("hdfs://hostname:port/user/me/stored.xls"));

This will do

Harsha Hulageri