ansaurus

Question

Answer 1

+1 A:

I haven't tried Hadoop Streaming myself but from reading the docs I think you can achieve similar parallel behavior.

Instead of passing a key with the associated values to each reducer, streaming will group the mapper output by keys. It also guarantees that values with the same keys won't be split over multiple reducers. This is somewhat different from normal Hadoop functionality, but even so, the reduce work will be distributed over multiple reducers.

Try to use the -verbose option to get more information about what's really going on. You can also try to experiment with the -D mapred.reduce.tasks=X option where X is the desired number of reducers.

sris 2009-05-08 13:22:49

Answer 2

+3 A:

Reducers will always run in parallel, whether you're using streaming or not (if you're not seeing this, verify that the job configuration is set to allow multiple reduce tasks -- see mapred.reduce.tasks in your cluster or job configuration). The difference is that the framework packages things up a little more nicely for you when you use Java versus streaming.

For Java, the reduce task gets an iterator over all the values for a particular key. This makes it easy to walk the values if you are, say, summing the map output in your reduce task. In streaming, you literally just get a stream of key-value pairs. You are guaranteed that the values will be ordered by key, and that for a given key will not be split across reduce tasks, but any state tracking you need is up to you. For example, in Java your map output comes to your reducer symbolically in the form

key1, {val1, val2, val3} key2, {val7, val8}

With streaming, your output instead looks like

key1, val1 key1, val2 key1, val3 key2, val7 key2, val8

For example, to write a reducer that computes the sum of the values for each key, you'll need a variable to store the last key you saw and a variable to store the sum. Each time you read a new key-value pair, you do the following:

check if the key is different than the last key.
if so, output your key and current sum, and reset the sum to zero.
add the current value to your sum and set last key to the current key.

HTH.

Kevin Weil 2009-05-09 19:13:10

ansaurus

tags:

views:

answers:

Parallelizing Ruby reducers in Hadoop?

related questions