ansaurus

Question

Answer 1

+8 A:

Read the first line from each of the log files
LOOP

a. Find the "earliest" line.

b. Insert the "earliest" line into the master log file

c. Read the next line from the file that contained the earliest line

You could check for duplicates between b and c, advancing the pointer for each of those files.

Adam Tegen 2008-09-24 16:09:10

I think I understand what you are saying. I totally skipped over the concept of what you suggest in step C and kept going down the wrong road. This will be sweet if it works. Thanks!

Josh 2008-09-24 16:26:13

Answer 2

+3 A:

Sure - open every log file. Read in the first line for each into an array of 'current' lines. Then repeatedly pick the line with the lowest timestamp from the current array. Write it to the output, and read a new line from the appropriate source file to replace it.

Here's an example in Python, but it makes good pseudocode, too:

def merge_files(files, key_func):
    # Populate the current array with the first line from each file
    current = [file.readline() for file in files]
    while len(current) > 0:
        # Find and return the row with the lowest key according to key_func
        min_idx = min(range(len(files)), key=lambda x: return key_func(current[x]))
        yield current[min_idx]
        new_line = files[min_idx].readline()
        if not new_line:
            # EOF, remove this file from consideration
            del current[min_idx]
            del files[min_idx]
        else:
            current[min_idx] = new_line

Nick Johnson 2008-09-24 16:09:14

Thanks so much! I ended up writing something very similar in Java and combined with the help from the first answer I think I got it.

Josh 2008-09-24 17:09:46

Answer 3

A:

Read only one line at a time from both source files. Compare the lines and write the older one to the output file (and advance to the next line). Do this until you have reached the end of both files and you've merged the files.

And make sure to remove duplicates :)

I guess this code in C# may illustrate the approach:

        StringReader fileStream1;
        StringReader fileStream2;
        Event eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
        Event eventCursorFile2 = Event.Parse(fileStream2.ReadLine());

        while !(fileStream1.EOF && fileStream2.EOF)
        {
            if (eventCursorFile1.TimeStamp < eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
            }
            else if (eventCursorFile1.TimeStamp == eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }
            else
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }  
        }

The break condition isn't exactly right as this is just Quick'n'dirty, but it should look similar..

Tigraine 2008-09-24 16:15:25

Thanks, but I am working with a variable number of files (e.g. from 1 to X), not just two, so it is not exactly what I am looking for but it does help.

Josh 2008-09-24 16:45:48

Answer 4

A:

OR you could borrow a log merge utility from Awstats which is an open source website stats tool.

logresolvemerge.pl is a perl script that can merge multiple log files : you can even use multiple threads to merge the log files(need to have perl 5.8 for multi-thread use). Why don't you try using a readily available tool instead of building one ?

anjanb 2008-09-24 19:00:02

The project is in Java, and I don't really want to make system calls or require perl. That, and I don't write information to a log file, I process the results of the merge as they are output from the merger periodically. That and it's a hobby project where I want to learn how to do it myself.

Josh 2008-09-24 19:23:40

Answer 5

+1 A:

Checkout this link: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

Use a heap (based on an array). The number of elements in this heap/array will be equal to the number of log files you have.
Read the first records from all the files and insert them into your heap.
Loop until (no more records in any of the files)

      > remove the max element from the heap
      > write it to the output
      > read the next record from the file to which the (previous) max element belonged
          if there are no more records in that file
              remove it from file list
              continue
      > if it's not the same as the (previous) max element, add it to the heap

Now you have all your events in one log file, they are sorted, and there are no duplicates. The time complexity of the algorithm is (n log k) where n is the total number of records and k is the number of log files.

You should use buffered reader and buffered writer objects when reading to and from files to minimize the number of disk reads and writes, in order to optimize for time.

Rajesh Konda 2009-06-29 00:51:36

ansaurus

tags:

views:

answers:

Algorithm for merging large files

related questions