tags:

views:

50

answers:

1

What is the easiest to use distributed map reduce programming system?

For example. in a distributed datastore containing many users, each with many connections, say I wanted to count the total number of connections:

Map:
for all records of type "user"
do for each user
    count number of connections
    retrun connection_count_for_one_user

Reduce:
reduce (connection_count_for_one_user)
    total_connections += connection_count_for_one_user

Is there any mapreduce system that lets me program in this way?

+1  A: 

Well i'll take a stab at making some suggestions, but your question isn't too clear.

So how are you storing your data? The storage mechanism is separate to how you apply MapReduce algorithms to the data. I'm going to assume you are using the Hadoop Distributed File System.

The problem you illustrate actually looks very similar to the typical Hadoop MapReduce word count example. Instead of words you are just counting users instead.

Some of the options you have for applying MapReduce to data stored on a HDFS are:

  • Java framework - good if you are comfortable with Java.
  • Pig - a high-level scripting language.
  • Hive - a data warehousing solution for Hadoop that provides an SQL like interface.
  • Hadoop streaming - allows you to write mappers and reducers in pretty much any language.

Which is easiest?

Well that all depends on what you feel comfortable with. If know Java take a look at the standard Java framework. If you are used to scripting languages you could use Pig or streaming. If you know SQL you could take a look at using Hive QL to query the HDFS. I would take a look the documentation for each as a starting point.

Binary Nerd
Ok, thanks, I'll take a look at these
Zubair
Hive and Pig look promising!
Zubair