What is Hadoop ?

A:

Please see Hadoop:

Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.

Andrew Hare 2009-10-12 02:29:12

@Andrew: I have gone through Wikipedia but would appreciate any detail explanation as I did not understood more from Wiki :)

Rachel 2009-10-12 02:30:27

What specifically do you not understand?

Andrew Hare 2009-10-12 02:30:59

@Andrew: It enables applications to work with thousands of nodes and petabytes of data.

Rachel 2009-10-12 02:32:29

@Andrew: What kind of work, what is motivation of creating Hadoop, What real times issues is solved by Hadoop, what earlier technologies were there which did not satisfy the need which lead to development of Hadoop?

Rachel 2009-10-12 02:34:38

+4 A:

I can only guess that you're not really asking what Hadoop is but what MapReduce is because that's basically what Hadoop is. I'd suggest looking at MapReduce: Simplified Data Processing on Large Clusters.

Search engines are an obvious application for this but there are others. The sort of algorithm this is used to solve is where:

There are huge amounts of data;
That data by necessity is distributed; and
The problem can be highly parallelized.

Or, to put it another way, Hadoop is designed as a distributed work manager for huge amounts of data on a large number of systems. But Hadoop is more than that in that it's also about monitoring, failover and scheduling.

Here are some example applications.

Google often uses the example of creating indexes. The Map operation takes a set of Web pages and creates keyword indexes. The reduce operation takes distributed keyword maps and combines them.

cletus 2009-10-12 03:00:06

@Cletus: Thank you for the information. This was something I was looking for.

Rachel 2009-10-12 03:27:56

A:

Disclosure: I do not have any exposure to Hadoop. Some of the descriptions below may fall short of "technically accurate", and this would be due to both my own ignorance and to my attempt at putting things in simple terms, in the spirit of Rachel's original question and various "redirects" following Andrew Hare's responses.

Please feel free to leave remarks and annotations (or even to directly edit this response) if you can improve the accuracy of the text while keeping the material accessible to non practitioners.

Hadoop allows to runs distributed applications "in the cloud", i.e. on a dynamic collection of commodity hardware hosts.

In very broad terms, the idea is to a run a task in parallel fashion and without having to worry [much] about where the individual portions of the task take place, nor where the data files associated with the task (whether read or write), reside.

Examples, what can and what cannot be "Hadooped"
Search engines are a typical example of task that lend themselves to be handled by Hadoop or similar systems, but other examples include big sorting jobs, document categorization/clustering, SVD and other linear algebra operations (obviously with "big" matrices) etc. A required characteristic of the tasks to be handled with Hadoop however is that they can be described in terms of two functions a Map() function which is used to split the task in several structurally identical pieces and a Reduce() function which takes individual"pieces" and processes them by producing mergeable results.

The idea about the MapReduce model is that it provides a generic and clear description for anything that can be processed in parallel. This helps both application developers who just provide the MapReduce function pair, but otherwise do not have to worry about synchronization details etc. and the Hadoop system itself which can then handle every job in the very same fashion (even though the implementation of the Map/Reduce functions may vary tremendously).

Another key component of Hadoop is the distributed file system which allows storing reliably huge amount of data (petabyte-sized). Hadoop includes several additional subsystems, features and configuration parameters which collectively allow producing fault tolerant, dynamic systems.

Vocabulary:
A node is a computer host which can run Map() or Reduce() jobs (as well as adhere to the protocole of the Hadoop system, i.e. "obey" the Hadoop manager which orchestrates all these moving parts.) A hadoop system can have several thousands nodes, i.e. potential workers to which it can feed jobs.
Petabyte = 1 Million Gigabytes.
Cloud Computing Computing based on dynamically scalable groups of general purpose computers (physical or virtual) available over a network (or more generically over the Internet). Two key concepts are: a) that applications using the cloud have no knowledge of nor direct control over the set of resources which services their requests at a given time. b) No software specific a given application gets installed "ahead of time" on the nodes (well... they do receive, and possibly cache the logic of the Map() or Reduce() function, and they do benefit from the current state of the file system with various datasets belonging to the application, but the idea is that they are "commodity". They can come and go, and yet the work will get done).

mjv 2009-10-12 03:54:24

+4 A:

The definition of Map and Reduce is a bit hard for some to get a grasp of so here is a simple example to get you into the right thought mode, lets begin with a standard google interview question:

Assume you have a 10TB file of search queries and 10 compute nodes. A search query is a line about 50 chars in length, that contains the word or phrase someone has entered for search.

Your task is to find the top 1000 search phrases.

Limitations:

You only get to scan the 10TB file ONCE
You have no stats or models about the search phrases
The compute nodes must all be used roughly equally.

Solution:

Using a hash function (in this area also known as a Global Hash Function) distribute the work amongst each of the node by the following process

Give each node a unique ID between 0-9
Upon every line scan of the 10TB file compute the hash of the line
Quantize the hash value between 0-9 in a uniform manner
Send that line to the compute node with the same ID

This process tries to ensure (based on poisson distribution) that each node will get roughly the same amount of work. Any other approach relies on models or stats about the search data to evenly distribute the load of work (which is not feasible when you don't know what the data is).

Hadoop uses a DHT and some other nifty things to provide amongst other things such workload balancing and also distributed storage access etc.

Beh Tou Cheh 2009-10-12 04:37:46

That's not how Hadoop distributes processing. What you describe would mean that a central processor reads the whole 10TB and sends it over the wire to worker nodes. If you lookup disk read speeds, you will see that this is not feasible. Instead, and I am simplifying here, Hadoop assigns map tasks to nodes based on file offsets (node 1 gets 0-1TB, node 2 gets 1TB-2TB, etc). It uses a distributed file system, HDFS, to store your 10TB so that it's broken up in chunks, and tries to place computation of tasks on the same nodes as those that store the chunks.

SquareCog 2009-10-12 19:45:05

ansaurus

tags:

views:

answers:

related questions