ansaurus

Question

Best way to check new-line-independent-identity of 2 files with python

Answer 1

+1 A:

Try the difflib module - it provides classes and functions for comparing sequences.

For your needs, the difflib.Differ class looks interesting.

class difflib.Differ

This is a class for comparing sequences of lines of text, and producing human-readable differences or deltas. Differ uses SequenceMatcher both to compare sequences of lines, and to compare sequences of characters within similar (near-matching) lines.

See the differ example, that compares two texts. The sequences being compared can also be obtained from the readlines() method of file-like objects.

gimel 2010-07-19 10:50:05

Answer 2

+2 A:

I think a simple convenience function like this should do the job:

from itertools import izip

def areFilesIdentical(filename1, filename2):
    with open(filename1, "rtU") as a:
        with open(filename2, "rtU") as b:
            # Note that "all" and "izip" are lazy
            # (will stop at the first line that's not identical)
            return all(myprint() and lineA == lineB
                       for lineA, lineB in izip(a.xreadlines(), b.xreadlines()))

AndiDog 2010-07-19 10:55:43

I'm curious what the 't' modifier does.

compie 2010-07-19 11:25:16

Well actually it's ignored and you can just write "rU". The file object ensures that all newlines are represented as '\n' with the universal newline mode ("U").

AndiDog 2010-07-19 12:46:38

Answer 3

A:

Looks like you just need to check if files are same or not ignoring whitespace/newlines.

You can use a function like this

def do_cmp(f1, f2):
    bufsize = 8*1024
    fp1 = open(f1, 'rb')
    fp2 = open(f2, 'rb')
    while True:
        b1 = fp1.read(bufsize)
        b2 = fp2.read(bufsize)
        if not is_same(b1, b2):
            return False
        if not b1:
            return True

def is_same(text1, text2):
    return text1.replace("\n","") == text2.replace("\n","")

you can improve is_same so that it matches according to your requirements e.g. you may ignore case too.

Anurag Uniyal 2010-07-19 10:57:41

Byte strings don't have a `remove` method. Guess you mean something like `.replace("\r\n", "").replace("\n", "").replace("\n\r", "")`? And this won't work if the '\r' is at the end of one buffer and the '\n' is on the beginning of the next buffer.

AndiDog 2010-07-19 11:02:23

@ yes replace, thanks!, point is inside `is_same` OP can do whatever comparison he needs and .replace("\n","").replace("\r","")

Anurag Uniyal 2010-07-19 13:08:40

ansaurus

tags:

views:

answers:

Best way to check new-line-independent-identity of 2 files with python

related questions