views:

41

answers:

1

I have run into a complex problem with Mapreduce. I am trying to match up 2 unique values that are not always present together in the same line. Once I map those out, I need to count the total number of unique events for that mapping.

The log files I am crunching are 100GB+ uncompressed and has data broken into 2 parts that I need to bring together. The events are spread across many different log files. I think the easiest way to describe the problem would be to show a sample of the logs.

[2010/09/23 12:02am]   AAAAAAAAAA   BBBBBBBBBB   Event message type A
[2010/09/23 12:02am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:03am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:09am]                BBBBBBBBBB

[2010/09/23 12:01am]   CCCCCCCCCC   DDDDDDDDDD   Event message type A
[2010/09/23 12:05am]                DDDDDDDDDD   Event message type A
[2010/09/23 12:06am]                DDDDDDDDDD   Event message type C

The 2nd and 3rd columns are Unique IDs that never match. I need to map out the number of unique items in the 4th column linked to the 2nd and 3rd columns. The 2nd column is always present at least once. The 3rd column is always present. The 4th column may or may not be present. I still want to count the 4th column as an unknown event. The actual number of unique values reach in the high millions, with the total log lines reaching in the billions.

The solution for the above should be.

AAAAAAAAAA,BBBBBBBBBB,A    1
AAAAAAAAAA,BBBBBBBBBB,B    2
AAAAAAAAAA,BBBBBBBBBB,Unknown    1

CCCCCCCCCC,DDDDDDDDDD,A    2
CCCCCCCCCC,DDDDDDDDDD,C    1

I had thought about breaking down the 2nd and 3rd column in 2 separate mapreduces but bringing those results back together is hairy. Not sure how to do the final Mapreduce to combine these values. The 2nd column will be all over the place in the file. It could show up an 1am then again at 11pm.

Any suggestions on how I could use Hadoop mapreduce to solve this problem ? I am using Hadoop streaming, don't know Java.

Thanks in advanced for the help :D

+1  A: 

My suggestion to you is to do it as follows:

  1. Ensure that all records contain all values.
  2. Aggregate (i.e. count).

So you start with (slight variation on what you showed):

[2010/09/23 12:01am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:02am]   AAAAAAAAAA   BBBBBBBBBB   Event message type A
[2010/09/23 12:03am]                BBBBBBBBBB   Event message type B
[2010/09/23 12:09am]                BBBBBBBBBB   

[2010/09/23 12:01am]                DDDDDDDDDD   Event message type A
[2010/09/23 12:05am]   CCCCCCCCCC   DDDDDDDDDD   Event message type A
[2010/09/23 12:06am]                DDDDDDDDDD   Event message type C

The step 1 would use "BBBBBBBBBB" as the key and do a Secondary sort (See the Hadoop example and the explanation in Tom's book) to ensure that the record with the "AAAAAAAAAA" is the 'first' to arrive at the reducer. In the reducer you give all records the same "2nd column" value (the "AAAAAAAAAA") as the first one. You do no aggregation and simply make the records complete. This means also adding the "Unknown" in case there was no event.

So after that first step you have something like this:

   AAAAAAAAAA   BBBBBBBBBB   Event message type B
   AAAAAAAAAA   BBBBBBBBBB   Event message type A
   AAAAAAAAAA   BBBBBBBBBB   Event message type B
   AAAAAAAAAA   BBBBBBBBBB   Unknown

   CCCCCCCCCC   DDDDDDDDDD   Event message type A
   CCCCCCCCCC   DDDDDDDDDD   Event message type A
   CCCCCCCCCC   DDDDDDDDDD   Event message type C

Then in the second step you essentially do the same as the well known "Wordcount" example whit the entire "AAAAAAAAAA BBBBBBBBBB Event message type B" as your "word".

Giving you the desired output:

   AAAAAAAAAA   BBBBBBBBBB   Event message type B     2
   AAAAAAAAAA   BBBBBBBBBB   Event message type A     1
   AAAAAAAAAA   BBBBBBBBBB   Unknown                  1

   CCCCCCCCCC   DDDDDDDDDD   Event message type A     2
   CCCCCCCCCC   DDDDDDDDDD   Event message type C     1

HTH

Niels Basjes