ansaurus

Question

Using Mapreduce to map multiple unique values not always present on the same lines

Answer 1

+1 A:

My suggestion to you is to do it as follows:

Ensure that all records contain all values.
Aggregate (i.e. count).

So you start with (slight variation on what you showed):

[2010/09/23 12:01am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:02am]   AAAAAAAAAA   BBBBBBBBBB   Event message type A
[2010/09/23 12:03am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:09am]                BBBBBBBBBB   

[2010/09/23 12:01am]                DDDDDDDDDD   Event message type A
[2010/09/23 12:05am]   CCCCCCCCCC   DDDDDDDDDD   Event message type A
[2010/09/23 12:06am]                DDDDDDDDDD   Event message type C

The step 1 would use "BBBBBBBBBB" as the key and do a Secondary sort (See the Hadoop example and the explanation in Tom's book) to ensure that the record with the "AAAAAAAAAA" is the 'first' to arrive at the reducer. In the reducer you give all records the same "2nd column" value (the "AAAAAAAAAA") as the first one. You do no aggregation and simply make the records complete. This means also adding the "Unknown" in case there was no event.

So after that first step you have something like this:

   AAAAAAAAAA   BBBBBBBBBB   Event message type B
   AAAAAAAAAA   BBBBBBBBBB   Event message type A
   AAAAAAAAAA   BBBBBBBBBB   Event message type B
   AAAAAAAAAA   BBBBBBBBBB   Unknown

   CCCCCCCCCC   DDDDDDDDDD   Event message type A
   CCCCCCCCCC   DDDDDDDDDD   Event message type A
   CCCCCCCCCC   DDDDDDDDDD   Event message type C

Then in the second step you essentially do the same as the well known "Wordcount" example whit the entire "AAAAAAAAAA BBBBBBBBBB Event message type B" as your "word".

Giving you the desired output:

   AAAAAAAAAA   BBBBBBBBBB   Event message type B     2
   AAAAAAAAAA   BBBBBBBBBB   Event message type A     1
   AAAAAAAAAA   BBBBBBBBBB   Unknown                  1

   CCCCCCCCCC   DDDDDDDDDD   Event message type A     2
   CCCCCCCCCC   DDDDDDDDDD   Event message type C     1

HTH

Niels Basjes 2010-09-26 11:26:36

ansaurus

tags:

views:

answers:

Using Mapreduce to map multiple unique values not always present on the same lines

related questions