views:

1135

answers:

4

I'm looking for some general information about how other people are using Hadoop or other MapReduce-like technologies. In general, I am curious to whether you are writing MR applications to process existing data sets (like web server log files), or are you writing applications that generate and process new data sets?

Edit: Follow-up Questions

(1) Do you ever execute a MR program against data generated by other MR programs?

(2) Do you ever need to modify existing data sets using MR?

(3) Do you ever share your data sets with other developers?

+1  A: 

I am analyzing existing data sets, in my case traces of programmer activity.

Kent Beck
+4  A: 

Checkout the PowerdBy Hadoop wiki for examples of everything from Facebook to FOX News and how they are using it.

Ryan Cox
+1  A: 

I have used hadoop as part of nutch, and for building/analyzing web-graphs and text

(1) Many tasks cannot be done in one go, so the need to run MR on MR-generated data is essential.

(2) When crawling with nutch, there are situations when you need to filter or normalize the crawldb or other data. (So, yes)

(3) So far mainly as dumps or results in some kind. Not as "native" MR-data so far.

refrus
A: 

My two uses so far have been analysis of large behavioral data sets (gathered from the web, mobile handsets, &c) and parallelizing approaches to large problems (e.g., using genetic algorithms to find local optima in an NP-complete problem space).

In the general case, MR flows are multi-stage, so I'm frequently running against data generated by an earlier MR stage.

bradheintz