tags:

views:

5112

answers:

11
+19  Q: 

Hadoop examples?

I'm examining Hadoop as a possible tool with which to do some log analysis. I want to analyze several kinds of statistics in one run. Each line of my log files has all sorts of potentially useful that I'd like to aggregate. I'd like to get all sorts of data out of the logs in a single Hadoop run, but the example Hadoop programs I see online all seem to total exactly one thing. This may be because every single example Hadoop program I can find just does word counts. Can I use Hadoop to solve two or more problems at once?

Are there other Hadoop examples, or a Hadoop tutorial out there, that don't solve the word count problem?

+2  A: 

Have you looked at the wiki? You could try looking through the software in the contrib section though the code for those will probably be hard to learn from. Looking over the page they seem to have a link to an external tutorial.

fuzzy-waffle
+3  A: 

Here are two examples using Cascading (and API over Hadoop)

A simple log parser: http://bit.ly/47DaJ6

Calculates arrival rate of requests: http://bit.ly/9M04F

You can start with the second and just keep adding metrics.

Cascading project site http://www.cascading.org/

ckw

cwensel
+1  A: 

Amazon has a new service based on Hadoop, its a great way to get started and they have some nice examples. http://aws.amazon.com/elasticmapreduce/

Mo Flanagan
+1  A: 

There are several examples using ruby under Hadoop streaming in the wukong library. (Disclaimer: I am an author of same). Besides the now-standard wordcount example, there's pagerank and a couple simple graph manipulation scripts.

mrflip
+2  A: 

With the normal Map/Reduce paradigm, you typically solve one problem at a time. In the map step you typically perform some transformation or denormalization, in the Reduce step you often aggregate the map outputs.

If you want to answer multiple questions about your data, the best way to do it in Hadoop is to write multiple jobs, or a sequence of jobs that read the previous step's outputs.

There are several higher-level abstraction languages or APIs (Pig, Hive, Cascading) that simplify some of this work for you, allowing you to write more traditional procedural or SQL-style code that, under the covers, just creates a sequence of Hadoop jobs.

Ilya Haykinson
+17  A: 

One of the best resources that I have found to get started is Cloudera. They are a startup company comprised of mainly ex-Google and ex-Yahoo employees. On their page there is a training section with lessons on the different technologies here. I found that very useful in playing with straight Hadoop, Pig and Hive. They have a virtual machine that you can download that has everything configured and some examples that help you get coding. All of that is free in the training section. The only thing that I couldn't find is a tutorial on HBase. I have been looking for one for a while. Best of luck.

Ryan H
+1  A: 

There was a course taught by Jimmy Lin at the University of Maryland. He developed the Cloud9 package as a training tool. It contains several examples.

Cloud9 Documentation and Source

+3  A: 

I'm finishing up a tutorial on processing Wikipedia pageview log files, several parts of which compute multiple metrics in one pass (sum of pageviews, trend over the last 24 hours, running regressions, etc.). The code is here: http://github.com/datawrangling/trendingtopics/tree/master

The Hadoop code mostly uses a mix of Python streaming & Hive w/ the Cloudera distro on EC2...

Pete Skomoroch
i loved your tute pete, especially the overview you gave at hadoop world, awesome stuff!
matpalm
+1  A: 

You can also follow Cloudera blog, they posted recently a really good article about Apache log analysis with Pig.

Ro
As the author of said article, I want to point out that it was written more from a "getting familiar with Pig" perspective than a "doing log parsing in Hadoop" perspective. There are more efficient and less verbose ways to do those things. But yeah, Pig is nice for this sort of stuff at large scale.
SquareCog
A: 

I'm sure you've solved your problem by now, but for those who still get redirected here from google searching for examples here is a excellent blog with hundreds lines of working code: http://sujitpal.blogspot.com/

alex
+1  A: 

You can refer to Tom White's Hadoop book for more examples and usecases: http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449389732/

Pavan Yara