ansaurus

Question

Answer 1

+12 A:

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.

Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.

See http://docs.python.org/library/filecmp.html e.g.

>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False

Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte

Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.

import random
import string
import hashlib
import time

def getRandText(N):
    return  "".join([random.choice(string.printable) for i in xrange(N)])

N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)

def cmpHash(text1, text2):
    hash1 = hashlib.md5()
    hash1.update(text1)
    hash1 = hash1.hexdigest()

    hash2 = hashlib.md5()
    hash2.update(text2)
    hash2 = hash2.hexdigest()

    return  hash1 == hash2

def cmpByteByByte(text1, text2):
    return text1 == text2

for cmpFunc in (cmpHash, cmpByteByByte):
    st = time.time()
    for i in range(10):
        cmpFunc(randText1, randText2)
    print cmpFunc.func_name,time.time()-st

and the output is

cmpHash 0.234999895096
cmpByteByByte 0.0

Anurag Uniyal 2009-07-02 04:56:13

No reason to do an expensive hash when a simple byte-by-byte comparison will work. +1 for filecmp

John Kugelman 2009-07-02 04:58:53

If you have many huge files there's no reason to do an expensive byte-by-byte comparison when a simple hash calculation will work.

Vinko Vrsalovic 2009-07-02 05:01:01

yes agree , unless we have to compare N files with each other, can filecmp work there or be faster than hash?

Anurag Uniyal 2009-07-02 05:01:03

@vinko usually hash should be slower than byte-by-byte cmp, but as byte-by-byte cmp will be in python for loop I think it will be slower, as is the case of filecmp implementation

Anurag Uniyal 2009-07-02 05:02:50

@Anurag I'd like to see some proof of that statement. My understanding is the exact opposite

Vinko Vrsalovic 2009-07-02 05:04:16

@Vinko, i have modified the answer to include timing of two approaches

Anurag Uniyal 2009-07-02 05:29:08

Well, for a realistic test, one where the benefits of hashing for this purpose show, you should compare a single (same) 'file' to many different files, not just single pairs. In case I wasn't clear before: of course I agree that for the case where you will compare each file to only one other file byte-by-byte comparison will be faster (after all you have to read the whole file and make calculations to get a hash), things start to change when you want to compare one file to many other files, where the cost of calculating the hashes gets compensated by the number of comparisons.

Vinko Vrsalovic 2009-07-02 05:42:33

yes I agree and if you read my answer first line is "hashing the file would be the best way if you have to compare several files and store hashes for later comparison" and my first comment above also says so

Anurag Uniyal 2009-07-02 05:57:10

Don't forget you can have hash collisions! If the hashes compare OK you must proceed by comparing the file contents.

nosklo 2009-07-02 11:25:13

yes I will add that to answer, so in the case where there are going to be many similar files it is better to byte-by-byte cmp.

Anurag Uniyal 2009-07-02 12:19:47

Answer 2

A:

I'm not sure if you want to find duplicate files or just compare two single files. If the latter, the above approach (filecmp) is better, if the former, the following approach is better.

There are lots of duplicate files detection questions here. Assuming they are not very small and that performance is important, you can

Compare file sizes first, discarding all which doesn't match
If file sizes match, compare using the biggest hash you can handle, hashing chunks of files to avoid reading the whole big file

Here's is an answer with Python implementations (I prefer the one by nosklo, BTW)

Vinko Vrsalovic 2009-07-02 04:59:28

ansaurus

tags:

views:

answers:

see if two files are the same in python

related questions