ansaurus

Question

In Python, is there a concise way of comparing whether the contents of two text files are the same?

Answer 1

+15 A:

The low level way:

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

The high level way:

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...

Federico Ramponi 2008-10-31 17:50:04

I corrected your filecmp.cmp call, because without a non-true shallow argument, it doesn't do what the question asks for.

ΤΖΩΤΖΙΟΥ 2008-10-31 23:11:49

You're right. http://www.python.org/doc/2.5.2/lib/module-filecmp.html . Thank you very much.

Federico Ramponi 2008-11-01 03:21:44

Answer 2

+5 A:

if open('filename1','r').read() == open('filename2','r').read():
    // files are the same

Adam Rosenfield 2008-10-31 17:52:06

First, it would be better to open the files with 'rb'. Second, does this work (not "work correctly", just "work") for all file sizes?

ΤΖΩΤΖΙΟΥ 2008-10-31 23:14:20

Answer 3

+1 A:


f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2

mmattax 2008-10-31 17:52:16

“Well, I have this 8 GiB file and that 32 GiB file that I want to compare…”

ΤΖΩΤΖΙΟΥ 2008-10-31 23:19:37

Answer 4

+10 A:

If you're going for even basic efficiency, you probably want to check the file size first:

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

This saves you reading every line of two files that aren't even the same size, and thus can't be the same.

(Even further than that, you could call out to a fast MD5sum of each file and compare those, but that's not "in Python", so I'll stop here.)

Rich 2008-10-31 17:56:15

That's an excellent observation Rich.

Onorio Catenacci 2008-10-31 17:57:35

The md5sum approach will be slower with just 2 files (You still need to read the file to compute the sum) It only pays off when you're looking for duplicates among several files.

Brian 2008-10-31 18:15:13

@Brian: you're assuming that md5sum's file reading is no faster than Python's, and that there's no overhead from reading the entire file into the Python environment as a string! Try this with 2GB files...

Rich 2008-10-31 18:17:54

There's no reason to expect md5sum's file reading would be faster than python's - IO is pretty independant of language. The large file problem is a reason to iterate in chunks (or use filecmp), not to use md5 where you're needlessly paying an extra CPU penalty.

Brian 2008-11-01 00:13:01

This is especially true when you consider the case when the files are not identical. Comparing by blocks can bail out early, but md5sum must carry on reading the entire file.

Brian 2008-11-01 00:29:51

Answer 5

+2 A:

For larger files you could compute a MD5 or SHA hash of the files.

ConcernedOfTunbridgeWells 2008-10-31 17:56:33

So what about two 32 GiB files differing in the first byte only? Why spend CPU time and wait too long for an answer?

ΤΖΩΤΖΙΟΥ 2008-10-31 23:15:27

Answer 6

A:

I would use a hash of the file's contents using MD5.

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'

Jeremy Cantrell 2008-10-31 18:53:52

Answer 7

+3 A:

Since I can't comment on the answers of others I'll write my own.

If you use md5 you definitely must not just md5.update(f.read()) since you'll use too much memory.

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()

2008-10-31 19:06:03

I believe that any hashing operation is overkill for this question's purposes; direct piece-by-piece comparison is faster and more straight.

ΤΖΩΤΖΙΟΥ 2008-10-31 23:18:03

I was just clearing up the actual hashing part someone suggested.

2008-11-01 00:44:13

+1 I like your version better. Also, I don't think using a hash is overkill. There's really no good reason not to if all you want to know is whether or not they're different.

Jeremy Cantrell 2008-11-01 07:27:02

@Jeremy Cantrell: one computes hashes when they are to be cached/stored, or compared to cached/stored ones. Otherwise, just compare strings. Whatever the hardware, str1 != str2 is faster than md5.new(str1).digest() != md5.new(str2).digest(). Hashes also have collisions (unlikely but not impossible).

ΤΖΩΤΖΙΟΥ 2008-11-01 16:00:22

Answer 8

+2 A:

This is a functional-style file comparison function. It returns instantly False if the files have different sizes; otherwise, it reads in 4KiB block sizes and returns False instantly upon the first difference:

from __future__ import with_statement
import os
import itertools, functools, operator

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1:
        with open(filename2, "rb") as fp2:
            if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
                return False # different sizes ∴ not equal
            fp1_reader= functools.partial(fp1.read, 4096)
            fp2_reader= functools.partial(fp2.read, 4096)
            cmp_pairs= itertools.izip(iter(fp1_reader, ''), iter(fp2_reader, ''))
            inequalities= itertools.starmap(operator.ne, cmp_pairs)
            return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

Just a different take :)

ΤΖΩΤΖΙΟΥ 2008-10-31 23:03:01

ansaurus

tags:

views:

answers:

In Python, is there a concise way of comparing whether the contents of two text files are the same?

related questions