views:

1457

answers:

5

I have several log files of events (one event per line). The logs can possibly overlap. The logs are generated on separate client machines from possibly multiple time zones (but I assume I know the time zone). Each event has a timestamp that was normalized into a common time (by instantianting each log parsers calendar instance with the timezone appropriate to the log file and then using getTimeInMillis to get the UTC time). The logs are already sorted by timestamp. Multiple events can occur at the same time, but they are by no means equal events.

These files can be relatively large, as in, 500000 events or more in a single log, so reading the entire contents of the logs into a simple Event[] is not feasible.

What I am trying do is merge the events from each of the logs into a single log. It is kinda like a mergesort task, but each log is already sorted, I just need to bring them together. The second component is that the same event can be witnessed in each of the separate log files, and I want to "remove duplicate events" in the file output log.

Can this be done "in place", as in, sequentially working over some small buffers of each log file? I can't simply read in all the files into an Event[], sort the list, and then remove duplicates, but so far my limited programming capabilities only enable me to see this as the solution. Is there some more sophisticated approach that I can use to do this as I read events from each of the logs concurrently?

+8  A: 
  1. Read the first line from each of the log files

  2. LOOP

    a. Find the "earliest" line.

    b. Insert the "earliest" line into the master log file

    c. Read the next line from the file that contained the earliest line

You could check for duplicates between b and c, advancing the pointer for each of those files.

Adam Tegen
I think I understand what you are saying. I totally skipped over the concept of what you suggest in step C and kept going down the wrong road. This will be sweet if it works. Thanks!
Josh
+3  A: 

Sure - open every log file. Read in the first line for each into an array of 'current' lines. Then repeatedly pick the line with the lowest timestamp from the current array. Write it to the output, and read a new line from the appropriate source file to replace it.

Here's an example in Python, but it makes good pseudocode, too:

def merge_files(files, key_func):
    # Populate the current array with the first line from each file
    current = [file.readline() for file in files]
    while len(current) > 0:
        # Find and return the row with the lowest key according to key_func
        min_idx = min(range(len(files)), key=lambda x: return key_func(current[x]))
        yield current[min_idx]
        new_line = files[min_idx].readline()
        if not new_line:
            # EOF, remove this file from consideration
            del current[min_idx]
            del files[min_idx]
        else:
            current[min_idx] = new_line
Nick Johnson
Thanks so much! I ended up writing something very similar in Java and combined with the help from the first answer I think I got it.
Josh
A: 

Read only one line at a time from both source files. Compare the lines and write the older one to the output file (and advance to the next line). Do this until you have reached the end of both files and you've merged the files.

And make sure to remove duplicates :)

I guess this code in C# may illustrate the approach:

        StringReader fileStream1;
        StringReader fileStream2;
        Event eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
        Event eventCursorFile2 = Event.Parse(fileStream2.ReadLine());

        while !(fileStream1.EOF && fileStream2.EOF)
        {
            if (eventCursorFile1.TimeStamp < eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
            }
            else if (eventCursorFile1.TimeStamp == eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }
            else
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }  
        }

The break condition isn't exactly right as this is just Quick'n'dirty, but it should look similar..

Tigraine
Thanks, but I am working with a variable number of files (e.g. from 1 to X), not just two, so it is not exactly what I am looking for but it does help.
Josh
A: 

OR you could borrow a log merge utility from Awstats which is an open source website stats tool.

logresolvemerge.pl is a perl script that can merge multiple log files : you can even use multiple threads to merge the log files(need to have perl 5.8 for multi-thread use). Why don't you try using a readily available tool instead of building one ?

anjanb
The project is in Java, and I don't really want to make system calls or require perl. That, and I don't write information to a log file, I process the results of the merge as they are output from the merger periodically. That and it's a hobby project where I want to learn how to do it myself.
Josh
+1  A: 

Checkout this link: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

  • Use a heap (based on an array). The number of elements in this heap/array will be equal to the number of log files you have.

  • Read the first records from all the files and insert them into your heap.

  • Loop until (no more records in any of the files)

      > remove the max element from the heap
      > write it to the output
      > read the next record from the file to which the (previous) max element belonged
          if there are no more records in that file
              remove it from file list
              continue
      > if it's not the same as the (previous) max element, add it to the heap

Now you have all your events in one log file, they are sorted, and there are no duplicates. The time complexity of the algorithm is (n log k) where n is the total number of records and k is the number of log files.

You should use buffered reader and buffered writer objects when reading to and from files to minimize the number of disk reads and writes, in order to optimize for time.

Rajesh Konda