hadoop behind the scenes

Hadoop is a programming environment which enables running massive computations in parallel on a large cluster of machines. It is resilient to loss of several machines, scalable to enable faster computations by adding machines and trackable to report the computation status. Hadoop is popular because it is a strong open source environment and because many users, including large ones such as Yahoo!, Microsoft and Facebook, employ it for large data-crunching projects. It is powerful because it uses the map/reduce algorithm, which decomposes a computation into a sequence of two simple operations:

map - Take a list of items and perform the same simple operation on each of them. For example, take the text of a web page, tokenize it and replace every token with the string :1
reduce - Take a list of items and accumulate it using an accumulation operator. For example, take the list of :1, count the occurence of and output a list of the form :nt, where nt is the number of times appeared in the original list.

Using proper decomposition (Which the programmer does) and task distribution and monitoring (which Hadoop does) you get a fast scalable computation; In our example - a word-counting computation. You can sequence tens of maps and reduces and get implementations of sophisticated algorithms. This is the very high level view. Now go read about MapReduce and Hadoop in further detail.

I do not see any big idea in being able to count words. Obviously, Thousands of computers in a cloud are not required to resolve simple problems like these.Can you please elaborate the implementations of sophisticated algorithms, if you have info on any such algorithms being used by Google !?

2009-05-09 16:49:52

That is far from obvious. Counting words for the whole web is a formidable task - say you are indexing five billion web pages, each having just a hundred words. Thus you have to count five hundred billion words. I would not start this without a powerful cluster of computers. To get a feel for more complex applications of Hadoop, see Yahoo!'s webmap: http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html and Apache Mahout: http://lucene.apache.org/mahout/

Yuval F 2009-05-10 06:19:59

I was just trying to probe whether any of you pals know what is inside the box.I would like to share with you that I have developed some indexing algorithms the best (in terms of memory utilisation) of those algorithms yielded a benchmark performance of 800 transactions/sec on 20 character keys with a database size of 1 million records. This is utilising a dual core athalon 64 running at 3 GHz. The amount of index memory required is no more than the size of the keys if not less. This means 16GB of index data can be searched in one second. Appreciate your feedback on the performance.

2009-05-11 22:21:22

ansaurus

tags:

views:

answers:

hadoop behind the scenes

related questions