I don't care what the differences are. I just want to know whether the contents are different.
views:
807answers:
8In Python, is there a concise way of comparing whether the contents of two text files are the same?
The low level way:
from __future__ import with_statement
with open(filename1) as f1:
with open(filename2) as f2:
if f1.read() == f2.read():
...
The high level way:
import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
...
if open('filename1','r').read() == open('filename2','r').read():
// files are the same
f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2
If you're going for even basic efficiency, you probably want to check the file size first:
if os.path.getsize(filename1) == os.path.getsize(filename2):
if open('filename1','r').read() == open('filename2','r').read():
# Files are the same.
This saves you reading every line of two files that aren't even the same size, and thus can't be the same.
(Even further than that, you could call out to a fast MD5sum of each file and compare those, but that's not "in Python", so I'll stop here.)
I would use a hash of the file's contents using MD5.
import hashlib
def checksum(f):
md5 = hashlib.md5()
md5.update(open(f).read())
return md5.hexdigest()
def is_contents_same(f1, f2):
return checksum(f1) == checksum(f2)
if not is_contents_same('foo.txt', 'bar.txt'):
print 'The contents are not the same!'
Since I can't comment on the answers of others I'll write my own.
If you use md5 you definitely must not just md5.update(f.read()) since you'll use too much memory.
def get_file_md5(f, chunk_size=8192):
h = hashlib.md5()
while True:
chunk = f.read(chunk_size)
if not chunk:
break
h.update(chunk)
return h.hexdigest()
This is a functional-style file comparison function. It returns instantly False if the files have different sizes; otherwise, it reads in 4KiB block sizes and returns False instantly upon the first difference:
from __future__ import with_statement
import os
import itertools, functools, operator
def filecmp(filename1, filename2):
"Do the two files have exactly the same contents?"
with open(filename1, "rb") as fp1:
with open(filename2, "rb") as fp2:
if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
return False # different sizes ∴ not equal
fp1_reader= functools.partial(fp1.read, 4096)
fp2_reader= functools.partial(fp2.read, 4096)
cmp_pairs= itertools.izip(iter(fp1_reader, ''), iter(fp2_reader, ''))
inequalities= itertools.starmap(operator.ne, cmp_pairs)
return not any(inequalities)
if __name__ == "__main__":
import sys
print filecmp(sys.argv[1], sys.argv[2])
Just a different take :)