I have run into a complex problem with Mapreduce. I am trying to match up 2 unique values that are not always present together in the same line. Once I map those out, I need to count the total number of unique events for that mapping.
The log files I am crunching are 100GB+ uncompressed and has data broken into 2 parts that I need to bring together. The events are spread across many different log files. I think the easiest way to describe the problem would be to show a sample of the logs.
[2010/09/23 12:02am] AAAAAAAAAA BBBBBBBBBB Event message type A [2010/09/23 12:02am] BBBBBBBBBB Event message type B [2010/09/23 12:03am] BBBBBBBBBB Event message type B [2010/09/23 12:09am] BBBBBBBBBB [2010/09/23 12:01am] CCCCCCCCCC DDDDDDDDDD Event message type A [2010/09/23 12:05am] DDDDDDDDDD Event message type A [2010/09/23 12:06am] DDDDDDDDDD Event message type C
The 2nd and 3rd columns are Unique IDs that never match. I need to map out the number of unique items in the 4th column linked to the 2nd and 3rd columns. The 2nd column is always present at least once. The 3rd column is always present. The 4th column may or may not be present. I still want to count the 4th column as an unknown event. The actual number of unique values reach in the high millions, with the total log lines reaching in the billions.
The solution for the above should be.
AAAAAAAAAA,BBBBBBBBBB,A 1 AAAAAAAAAA,BBBBBBBBBB,B 2 AAAAAAAAAA,BBBBBBBBBB,Unknown 1 CCCCCCCCCC,DDDDDDDDDD,A 2 CCCCCCCCCC,DDDDDDDDDD,C 1
I had thought about breaking down the 2nd and 3rd column in 2 separate mapreduces but bringing those results back together is hairy. Not sure how to do the final Mapreduce to combine these values. The 2nd column will be all over the place in the file. It could show up an 1am then again at 11pm.
Any suggestions on how I could use Hadoop mapreduce to solve this problem ? I am using Hadoop streaming, don't know Java.
Thanks in advanced for the help :D