ansaurus

Question

Which Hadoop product is more appropriate for a quick query on a large data set?

Answer 1

+3 A:

Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.

HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.

SquareCog 2009-12-14 00:48:26

Answer 2

+1 A:

For posterity, here's the answer Xueling received on the Hadoop mailing list:

First, further detail from Xueling:

The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the biggest table (400,000,000 records) is around 12 mins. So if using any Hadoop product can speed up the search then the product is what we are looking for.

The response, from Cloudera's Todd Lipcon:

In that case, I would recommend the following:

Put all of your data on HDFS

Write a MapReduce job that sorts the data by position of match

As a second output of this job, you can write a "sparse index" - basically a set of entries like this:

where you're basically giving offsets into every 10K records or so. If you index every 10K records, then 5 billion total will mean 100,000 index entries. Each index entry shouldn't be more than 20 bytes, so 100,000 entries will be 2MB. This is super easy to fit into memory. (you could probably index every 100th record instead and end up with 200MB, still easy to fit in memory)

Then to satisfy your count-range query, you can simply scan your in-memory sparse index. Some of the indexed blocks will be completely included in the range, in which case you just add up the "number of entries following" column. The start and finish block will be partially covered, so you can use the file offset info to load that file off HDFS, start reading at that offset, and finish the count.

Total time per query should be <100ms no problem.

A few subsequent replies suggested HBase.

Jeff Hammerbacher 2009-12-25 22:00:35

Answer 3

A:

You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it's for querying JSON data. But maybe this helps anyway.

Peter Wippermann 2010-01-12 00:16:10

So it's not for Hadoop, and it won't handle his data, and generally has nothing to do with the problem at hand, but it might be useful anyway?

skaffman 2010-01-12 00:19:32

Well, did you read the abstract at the main page?"Jaql is a query language designed for Javascript Object Notation (JSON), a data format that has become popular because of its simplicity and modeling flexibility. Jaql is primarily used to analyze large-scale semi-structured data. Core features include user extensibility and parallelism. In addition to modeling semi-structured data, JSON simplifies extensibility. Hadoop's Map-Reduce is used for parallelism."So in fact it is for Hadoop! ;-)

Peter Wippermann 2010-01-27 23:36:41

ansaurus

tags:

views:

answers:

Which Hadoop product is more appropriate for a quick query on a large data set?

related questions