Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.
Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.
See http://docs.python.org/library/filecmp.html
e.g.
>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False
Speed consideration:
Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte
Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.
import random
import string
import hashlib
import time
def getRandText(N):
return "".join([random.choice(string.printable) for i in xrange(N)])
N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)
def cmpHash(text1, text2):
hash1 = hashlib.md5()
hash1.update(text1)
hash1 = hash1.hexdigest()
hash2 = hashlib.md5()
hash2.update(text2)
hash2 = hash2.hexdigest()
return hash1 == hash2
def cmpByteByByte(text1, text2):
return text1 == text2
for cmpFunc in (cmpHash, cmpByteByByte):
st = time.time()
for i in range(10):
cmpFunc(randText1, randText2)
print cmpFunc.func_name,time.time()-st
and the output is
cmpHash 0.234999895096
cmpByteByByte 0.0