views:

102

answers:

1

Hi,

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

+2  A: 

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

Peter Wippermann
@Pentius - Does the mapper on a machine access the data on other machines too or just processes the data on its machine?
Deepak Konidena
The default word count example on the Hadoop site, doesn't use InputFormat. What happens if i call n map tasks on that example. Each map task accesses all the content in the file? Thanks, again.
Deepak Konidena
First of all, thanks for the vote :-) --- The wordcount example uses TextInputFormat, which should be a subclass of InputFormat. --- Since the number of splits matches the number of mappers, each mapper will process most likely the data that is nearest to it. Of course he would be able to access other machines, but this is avoided because of its costs.
Peter Wippermann