tags:

views:

860

answers:

2

Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?

What is the easiest way to see if two files are the same in Python.

One thing I can do is md5 each file and compare. Is there a better way?

+12  A: 

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.

Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.

See http://docs.python.org/library/filecmp.html e.g.

>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False

Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte

Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.

import random
import string
import hashlib
import time

def getRandText(N):
    return  "".join([random.choice(string.printable) for i in xrange(N)])

N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)

def cmpHash(text1, text2):
    hash1 = hashlib.md5()
    hash1.update(text1)
    hash1 = hash1.hexdigest()

    hash2 = hashlib.md5()
    hash2.update(text2)
    hash2 = hash2.hexdigest()

    return  hash1 == hash2

def cmpByteByByte(text1, text2):
    return text1 == text2

for cmpFunc in (cmpHash, cmpByteByByte):
    st = time.time()
    for i in range(10):
        cmpFunc(randText1, randText2)
    print cmpFunc.func_name,time.time()-st

and the output is

cmpHash 0.234999895096
cmpByteByByte 0.0
Anurag Uniyal
No reason to do an expensive hash when a simple byte-by-byte comparison will work. +1 for filecmp
John Kugelman
If you have many huge files there's no reason to do an expensive byte-by-byte comparison when a simple hash calculation will work.
Vinko Vrsalovic
yes agree , unless we have to compare N files with each other, can filecmp work there or be faster than hash?
Anurag Uniyal
@vinko usually hash should be slower than byte-by-byte cmp, but as byte-by-byte cmp will be in python for loop I think it will be slower, as is the case of filecmp implementation
Anurag Uniyal
@Anurag I'd like to see some proof of that statement. My understanding is the exact opposite
Vinko Vrsalovic
@Vinko, i have modified the answer to include timing of two approaches
Anurag Uniyal
Well, for a realistic test, one where the benefits of hashing for this purpose show, you should compare a single (same) 'file' to many different files, not just single pairs. In case I wasn't clear before: of course I agree that for the case where you will compare each file to only one other file byte-by-byte comparison will be faster (after all you have to read the whole file and make calculations to get a hash), things start to change when you want to compare one file to many other files, where the cost of calculating the hashes gets compensated by the number of comparisons.
Vinko Vrsalovic
yes I agree and if you read my answer first line is "hashing the file would be the best way if you have to compare several files and store hashes for later comparison" and my first comment above also says so
Anurag Uniyal
Don't forget you can have hash collisions! If the hashes compare OK you must proceed by comparing the file contents.
nosklo
yes I will add that to answer, so in the case where there are going to be many similar files it is better to byte-by-byte cmp.
Anurag Uniyal
A: 

I'm not sure if you want to find duplicate files or just compare two single files. If the latter, the above approach (filecmp) is better, if the former, the following approach is better.

There are lots of duplicate files detection questions here. Assuming they are not very small and that performance is important, you can

  • Compare file sizes first, discarding all which doesn't match
  • If file sizes match, compare using the biggest hash you can handle, hashing chunks of files to avoid reading the whole big file

Here's is an answer with Python implementations (I prefer the one by nosklo, BTW)

Vinko Vrsalovic