views:

40

answers:

2

I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.

Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we've resorted to making our input files smaller than 64mb each.

The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?

A: 

If you have multiple terabytes input you should consider setting block size to even more then 128MB.

If file is bigger than one block it can either be split, so each block of file would go to different mapper, or whole file can go to one mapper (for example if this file is gzipped). But I guess you can set this using some configuration options.

Splits are taken care of automatically and you should not worry about it. Output from maps is stored in tmp directory on hdfs.

Wojtek
+1  A: 

This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).

You can also customize in your streaming how the inputs are going into your mapper

http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs

Data from the map step gets sorted together based on a partioner class against the key of the map.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29

The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.

Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)

Joe Stein
This is perfect. Most of my problems were solved by restricting the file split size to 64MB (the size of the block), thus mapping each file (which fits into one block) to a single map process. On a 2 node cluster, we managed to process about 2gb of data in 3 minutes - ridiculously fast :)
bhargav