views:

42

answers:

1

Hello, I would like to know how to retrieve data from aggregated logs? This is what I have:
- about 30GB daily of uncompressed log data loaded into HDFS (and this will grow soon to about 100GB)
This is my idea:
- each night this data is processed with Pig
- logs are read, split, and custom UDF retrieves data like: timestamp, url, user_id (lets say, this is all what I need)
- from log entry and loads this into HBase (log data will be stored infinitely)

Then if I want to know which users saw particular page within given time range I can quickly query HBase without scanning whole log data with each query (and I want fast answers - minutes are acceptable). And there will be multiple querying taking place simultaneously.

What do you think about this workflow? Do you think, that loading this information into HBase would make sense? What are other options and how do they compare to my solution? I appreciate all comments/questions and answers. Thank you in advance.

A: 

With Hadoop you are always doing one of two things (either processing or querying).

For what you are looking to-do I would suggest using HIVE http://hadoop.apache.org/hive/. You can take your data and then create a M/R job to process and push that data how you like it into HIVE tables. From there (you can even partition on data as it might be appropriate for speed to not look at data not required as you say). From here you can query out your data results as you like. Here is very good online tutorial http://www.cloudera.com/videos/hive_tutorial

There are a lots of ways to solve this but it sounds like HBase is a bit overkill unless you want to setup all the server required for it to run as an exercise to learn it. HBase would be good if you have thousands of people simultaneously looking to get at the information.

You might also want to look into FLUME which is new import server from Cloudera . It will get your files from some place straight to HDFS http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/

Joe Stein
I'm familiar with Hive, and used it for querying, but It's definitely too slow. Analyzing a month of logs (up to 3TB) will take about 2-3 h on my current hardware, and I want to have results in matter of minutes (10 min. at most). I'm using Hive or Pig right now to do ad-hoc queries (since I don't have anything else), but I'm looking for other solutions or ideas.
Wojtek