tags:

views:

69

answers:

2

Hello,

I have some simple data stored in series of text files. Once line per record, but number and type of fields can vary per record.

The files contain almost the same data.

The exists an "ideal" data file to which these must be compared. Some fields can vary, but some need to match. I also need to now if any records are missing / added compared to the master.

What would be a good approach to take?

Thank you

A: 

I modified the following to simply iterate over every combination of lines from file1 and file2. I think the for, else construction works well here.

def comparefiles(file1, file2):
    for row in file1:
        for row in file2:
             #check all of your lines, break if condition is met
        else:
             #no matches
mvrak
I know this is just a rough guideline but you may also want to check if *file2* contains additional rows that are not in *file1*.
Alexandre Jasmin
This would certainly work, but is there a way to gather up statistic on "correctness" and not totally die if file2 is missing a record at the beginning? (also keeping in mind i'd be looping through 20+ "file2"s)
nonot1
The break doesn't totally die, it would only exit the the second for loop. Basically, when you find the match to the line in file 1, the break would say ok I found my match, I no longer have to check file2.
mvrak
You would invoke this function for each file2 that you are iterating through. What you return from this function is dependent on your use, which you didn't detail here.
mvrak
A: 

I completely agree with unutbu, you should use difflib for that.

difflib.SequenceMatcher(None, file1.read(), file2.read())
pyfunc
Looking at difflib docs, but it seems built for text. In my case i have non-"string" text data too. 1.9 is "close enough" to 2.0, for instance.
nonot1
@user489549: you could always do a second pass to sanitize the diff lines. This way, bulk of logic to copare lines will be left to difflib and custom logic can be applied to changed lines. This way the sequencer can also match when multiple lines has been inserted lines appropriately
pyfunc
@user489549: In fact, it won't die if some lines were missing. It will show up in the diff.
pyfunc
Can difflib work on custom objects?
nonot1