ansaurus

Question

Finding matching lines with Hadoop/MapReduce

Answer 1

+3 A:

I think you could do this by making your map function output the UUID as its key and the rest of the line as its value. Then the reduce function will be passed a collection of all the log entries with the same UUID. As it processes them it can keep track of the various events it sees and take action accordingly - for instance, when it sees a START event it can set a local variable to the time extracted from the start line, and then when it sees a STOP event it can extract the time from it, subtract the start time, and output the difference (and do similarly if it sees the STOP before the START).

aem 2010-02-05 21:36:08

Answer 2

+3 A:

If you emit the UUID in map as key: emit(<uuid>, <event, timestamp>) you'll receive in your reduce all events of this UUID: key = UUID, values = {<event1, timestamp1>, <event2, timestamp2>}

Then you can sort the events on timestamp and decide whether to emit them into a resulting file or not.

Bonus: you can use job.setSortComparatorClass(); for setting your own sorting class, so you'll get your entries already sorted on their timestamps in reduce:

public static class BNLSortComparator extends Text.Comparator {
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
    String sb1, sb2;
    try {
      sb1 = Text.decode(b1, s1, l1);
      ...

Leonidas 2010-02-05 21:36:22

Of course, this makes sense. Instead of finding matches, I group them by key. This would also allow me to analyze the other events in the future. Thanks

phisch 2010-02-06 08:20:46

ansaurus

tags:

views:

answers:

Finding matching lines with Hadoop/MapReduce

related questions