I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.
Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we've resorted to making our input files smaller than 64mb each.
The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?