How to ensure MapReduce tasks are independent from each other?

If the data IS related it is your job to ensure that the information is passed along. MapReduce breaks up the data and processes it regardless of any (not implemented) relations:

Map just reads data in blocks from the input files and passes them to the map-function one "record" at a time. Default-record is a line (but can be modified).

You can annotate the data in Map with its origin but what you can basically do with Map is: categorize the data. You emit a new key and new values and MapReduce groups by the new key. So if there are relations between different records: choose the same (or similiar *1) key for emitting them, so they are grouped together.

For Reduce the data is partitioned/sorted (that is where the grouping takes places) and afterwards the reduce-function receives all data from one group: one key and all its associated values. Now you can aggregate over the values. That's it.

So you have an over-all group-by implemented by MapReduce. Everything else is your responsibility. You want a cross product from two sources? Implement it for example by introducing artifical keys and multi-emitting (fragment and replicate join). Your imagination is the limit. And: you can always pass the data through another job.

*1: similiar, because you can influence the choice of grouping later on. normally it is group be identity-function, but you can change this.

ansaurus

tags:

views:

answers:

How to ensure MapReduce tasks are independent from each other?

related questions