views:

230

answers:

2

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.

As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.

My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.

My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?

I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.

Any comments / answers?

+1  A: 

Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:

map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.

map (reduce function) on (map_outputs); store into hbase.

You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.

Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase): http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability

SquareCog
I'm not entirely sure what kind of schemas you can have with HBase, but I thought it was always Key => Value where Key is unique. Would the intermediary outputs (from Map functions) fit in this structure, given that I may have many outputs with the same key?Also, I think your suggestion would mean merging the output from one run with that of a previous one. Is this possible? I assumed the output from a MR execution emptied the destination (FS directory or BigTable table) first?
Joe
HBase supports rows and columns; the difference from RDBMS (briefly) is that transactions are not available across rows (but you can have acid guarantees on updates to different columns in the same row), and that columns are sparse -- you can have many columns, and different rows can have different columns. In regular MR, appends are impossible (HDFS issue), but with HBase you can simply insert more rows into a table, so I think it should work.
SquareCog
+1  A: 

If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.

Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.

Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.

bradheintz