views:

268

answers:

2

I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine.

Now I'd like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data)

Each line in the log hast this format

<UUID> <Event> <Timestamp>

where event can be INIT, START, STOP, ERROR and some other. What I am interested in most is the elapsed time between START and STOP events for the same UUID.

For Example, my log contains entries like these

35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584
[...many other lines...]
35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777

My current, linear program reads through the files, remembers the start events in-memory, and writes the elapsed time to a file once it found the corresponding end event (lines with other events are currently ignored, ERROR events invalidate a UUID and it will be ignored, too)1

I would like to port this to an Hadoop/MapReduce program. But I am not sure how to do the matching of entries. Splitting/Tokenizing the file is easy, and I guess that finding the matches will be a Reduce-Class. But how would that look like? How do I find mathing entries in a MapReduce Job?

Please keep in mind that my main focus is to understand Hadopo/MapReduce; links to Pig and other Apache Programs are welcome, but I'd like to solve this one with pure Hadoop/MapReduce. Thank you.

1) Since the log is taken from a running application, some start events might not yet have corresponding end events and there will be end-events without startevents, due to logfile splitting

+3  A: 

I think you could do this by making your map function output the UUID as its key and the rest of the line as its value. Then the reduce function will be passed a collection of all the log entries with the same UUID. As it processes them it can keep track of the various events it sees and take action accordingly - for instance, when it sees a START event it can set a local variable to the time extracted from the start line, and then when it sees a STOP event it can extract the time from it, subtract the start time, and output the difference (and do similarly if it sees the STOP before the START).

aem
+3  A: 

If you emit the UUID in map as key: emit(<uuid>, <event, timestamp>) you'll receive in your reduce all events of this UUID: key = UUID, values = {<event1, timestamp1>, <event2, timestamp2>}

Then you can sort the events on timestamp and decide whether to emit them into a resulting file or not.

Bonus: you can use job.setSortComparatorClass(); for setting your own sorting class, so you'll get your entries already sorted on their timestamps in reduce:

public static class BNLSortComparator extends Text.Comparator {
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
    String sb1, sb2;
    try {
      sb1 = Text.decode(b1, s1, l1);
      ...
Leonidas
Of course, this makes sense. Instead of finding matches, I group them by key. This would also allow me to analyze the other events in the future. Thanks
phisch