ansaurus

Question

How to Compare 2 very large matrices using Python

Answer 1

+3 A:

Like this.

Step 1. Sort.

Step 2. Read each file, doing line-by-line comparison. Write differences to another file.

You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html

Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.

And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

S.Lott 2010-09-21 13:25:23

Answer 2

+1 A:

This is a little bit of a naive implementation but will deal with unsorted data:

import csv

file1_dict = {}
file2_dict = {}

with open('file1.csv') as handle:
    for row in csv.reader(handle):
        file1_dict[tuple(row[:2])] = row[2:]

with open('file2.csv') as handle:
    for row in csv.reader(handle):
        file2_dict[tuple(row[:2])] = row[2:]

with open('outfile.csv', 'w') as handle:
    writer = csv.writer(handle)
    for key, val in file1_dict.iteritems():
        if key in file2_dict:
            #deal with keys that are in both
            if file2_dict[key] == val:          
                writer.writerow(key+val+('Same',))
            else:
                writer.writerow(key+file2_dict[key]+('Modified',))
            file2_dict.pop(key)
        else:
            writer.writerow(key+val+('Removed',))
    #deal with added keys!  
    for key, val in file2_dict.iteritems():
        writer.writerow(key+val+('Added',))

You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. @S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.

Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:

#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap

def str2date(in):
    return datetime.date(*map(int,in.split('-')))

def convert_tups(row):
    key = (row[0], str2date(row[1]))
    val = tuple(row[2:])
    return key, val

with open('file1.csv') as handle1:
    with open('file2.csv') as handle2:
        with open('outfile.csv', 'w') as outhandle:
            writer = csv.writer(outhandle)
            gen1 = imap(convert_tups, csv.reader(handle1))
            gen2 = imap(convert_tups, csv.reader(handle2))
            gen2key, gen2val = gen2.next()      
            for gen1key, gen1val in gen1:
                if gen1key == gen2key and gen1val == gen2val:
                    writer.writerow(gen1key+gen1val+('Same',))
                    gen2key, gen2val = gen2.next()
                elif gen1key == gen2key and gen1val != gen2val:
                    writer.writerow(gen2key+gen2val+('Modified',))
                    gen2key, gen2val = gen2.next()
                elif gen1key > gen2key:
                    while gen1key>gen2key:
                        writer.writerow(gen2key+gen2val+('Added',))
                        gen2key, gen2val = gen2.next()
                else:
                    writer.writerow(gen1key+gen1val+('Removed',))

JudoWill 2010-09-21 14:53:26

ansaurus

tags:

views:

answers:

How to Compare 2 very large matrices using Python

related questions