I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine.
Now I'd like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data)
Each line in the log hast this format
<UUID> <Event> <Timestamp>
where event can be INIT, START, STOP, ERROR and some other. What I am interested in most is the elapsed time between START and STOP events for the same UUID.
For Example, my log contains entries like these
35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584
[...many other lines...]
35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777
My current, linear program reads through the files, remembers the start events in-memory, and writes the elapsed time to a file once it found the corresponding end event (lines with other events are currently ignored, ERROR events invalidate a UUID and it will be ignored, too)1
I would like to port this to an Hadoop/MapReduce program. But I am not sure how to do the matching of entries. Splitting/Tokenizing the file is easy, and I guess that finding the matches will be a Reduce-Class. But how would that look like? How do I find mathing entries in a MapReduce Job?
Please keep in mind that my main focus is to understand Hadopo/MapReduce; links to Pig and other Apache Programs are welcome, but I'd like to solve this one with pure Hadoop/MapReduce. Thank you.
1) Since the log is taken from a running application, some start events might not yet have corresponding end events and there will be end-events without startevents, due to logfile splitting