ansaurus

Question

Answer 1

+1 A:

You can get some control over which keys get sent to which reducers by implementng the Partitioner interface

From the Hadoop API docs:

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

The following book does a great job of describing partitioning, key sorting strategies and tradeoffs along with other issues in map reduce algorithm design: http://www.umiacs.umd.edu/~jimmylin/book.html

Alex Blakemore 2010-09-19 05:03:33

If you have a single node "cluster" then all keys will be sent to partition '0'. Using the partitioner won't do the trick. See my own answer for details why.

Niels Basjes 2010-09-19 08:18:35

Answer 2

A:

My guess is same as above, just you can sort the keys if possible and try to assign it reducer based on your partitioning criteria, refer youtube mapreduce ucb 61a lecture-34, they talk about this stuff.

R

ram 2010-09-19 05:47:13

Answer 3

+2 A:

This question is a bit unclear for me. But I think I have a pretty good idea what you want.

First of all if you do nothing special every time a reduce is called it gets only one single key with a set of one or more values (via an iterator).

My guess is that you want to ensure that every reducer gets exactly one 'key-value pair'. There are essentially two ways of doing that:

Ensure in the mapper that all keys that are output are unique. So for each key there is only one value.
Force the reducer to do this by forcing a group comparator that simply classifies all keys as different.

So if I understand your question correctly. You should implement a GroupComparator that simply states that all keys are different and should therefor be sent to a different reducer call.

Because of other answers in this question I'm adding a bit more detail:

There are 3 methods used for comparing keys (I pulled these code samples from a project I did using the 0.18.3 API):

Partitioner

    conf.setPartitionerClass(KeyPartitioner.class);

The partitioner is only to ensure that "things that must be the same end up on the same partition". If you have 1 computer there is only one partition, so this won't help much.

Key Comparator

    conf.setOutputKeyComparatorClass(KeyComparator.class);

The key comparator is used to SORT the "key-value pairs" in a group by looking at the key ... which must be different somehow.

Group Comparator

    conf.setOutputValueGroupingComparator(GroupComparator.class);

The group comparator is used to group keys that are different, yet must be sent o the same reducer.

HTH

Niels Basjes 2010-09-19 08:17:22

ansaurus

tags:

views:

answers:

hadoop + one key to every reducer

related questions