views:

116

answers:

2

I have an interesting problem.

I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.

I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.

The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.

Example of what a file and its new file might look like:

Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...

New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...

Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.

+3  A: 

Like this.

Step 1. Sort.

Step 2. Read each file, doing line-by-line comparison. Write differences to another file.

You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html

Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.

And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

S.Lott
+1  A: 

This is a little bit of a naive implementation but will deal with unsorted data:

import csv

file1_dict = {}
file2_dict = {}

with open('file1.csv') as handle:
    for row in csv.reader(handle):
        file1_dict[tuple(row[:2])] = row[2:]

with open('file2.csv') as handle:
    for row in csv.reader(handle):
        file2_dict[tuple(row[:2])] = row[2:]

with open('outfile.csv', 'w') as handle:
    writer = csv.writer(handle)
    for key, val in file1_dict.iteritems():
        if key in file2_dict:
            #deal with keys that are in both
            if file2_dict[key] == val:          
                writer.writerow(key+val+('Same',))
            else:
                writer.writerow(key+file2_dict[key]+('Modified',))
            file2_dict.pop(key)
        else:
            writer.writerow(key+val+('Removed',))
    #deal with added keys!  
    for key, val in file2_dict.iteritems():
        writer.writerow(key+val+('Added',))

You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. @S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.

Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:

#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap

def str2date(in):
    return datetime.date(*map(int,in.split('-')))

def convert_tups(row):
    key = (row[0], str2date(row[1]))
    val = tuple(row[2:])
    return key, val

with open('file1.csv') as handle1:
    with open('file2.csv') as handle2:
        with open('outfile.csv', 'w') as outhandle:
            writer = csv.writer(outhandle)
            gen1 = imap(convert_tups, csv.reader(handle1))
            gen2 = imap(convert_tups, csv.reader(handle2))
            gen2key, gen2val = gen2.next()      
            for gen1key, gen1val in gen1:
                if gen1key == gen2key and gen1val == gen2val:
                    writer.writerow(gen1key+gen1val+('Same',))
                    gen2key, gen2val = gen2.next()
                elif gen1key == gen2key and gen1val != gen2val:
                    writer.writerow(gen2key+gen2val+('Modified',))
                    gen2key, gen2val = gen2.next()
                elif gen1key > gen2key:
                    while gen1key>gen2key:
                        writer.writerow(gen2key+gen2val+('Added',))
                        gen2key, gen2val = gen2.next()
                else:
                    writer.writerow(gen1key+gen1val+('Removed',))
JudoWill