views:

51

answers:

1

If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability, cost, or ease of use to using HBase as a MR source?

The best I've been able to find is this quote, "HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets." - Tom White (2009) Hadoop: The Definitive Guide, 1st Edition

+1  A: 

Using straight-up Hadoop Map/Reduce over HDFS, your inputs and outputs are typically stored as flat text files or Hadoop SequenceFiles, which are simply serialized objects streamed to disk. These data stores are more or less immutable. This makes Hadoop suitable for batch processing tasks.

HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset.

What's nice about HBase is that it plays nicely with the Hadoop ecosystem, so if you have the need to perform batch processing as well as interactive, granular, record-level operations on huge datasets, HBase will do both well.

bajafresh4life
Thanks, what's what I was looking for.
Andre

related questions