ansaurus

Question

How to detect whether two files are identical in Python

Answer 1

A:

yes, it is enough

Am 2009-11-17 13:38:09

-1: No, it isn't.

Scott Griffiths 2009-11-17 15:27:10

Answer 2

+7 A:

Well, that will tell you whether they're definitely different or probably the same. It's possible for two files to have the same hash but not actually have the same data... just very unlikely.

In your situation, what is the impact if you get a false positive (i.e. if you think they're the same, but they're not)? MD5 is probably good enough not to worry about collisions if they would only occur accidentally... but if you've got security (or money) at stake and someone could plant a "bad" file with the same hash as a "good" file, you shouldn't rely on it.

Personally, I'd probably just read both files, comparing each byte - for a one off comparison, both the hashing and this approach will require reading the whole file when they're equal; as Daniel points out in the comments, doing a byte-by-byte comparison lets you exit early as soon as you see a difference. Comparing the file sizes first is another quick optimization :)

The general advantage of hashing occurs when you store the hash of the existing file somewhere, so that next time you can just read the new file.

Jon Skeet 2009-11-17 13:38:09

Depends on the source. If there is a trust/security issue at stake, and you cannot rely on the good intent of the file creator(s), then "unlikely" isn't the appropriate word.

MSalters 2009-11-17 13:40:21

@MSalters: Indeed, will elaborate.

Jon Skeet 2009-11-17 13:41:01

Once had a 'university lecturer' tell me unequivocaly, no two files could have the same md5 and not be identical in data content.

Dominic Bou-Samra 2009-11-17 13:45:25

@Jon, thanks for your explanation. How to generate hash value of a file? hashlib looks like works with string only.

jack 2009-11-17 13:50:38

"both the hashing and this approach will require reading the whole file" - only true in the case that the files are the same. You can bail out early in the case when the files differ if doing the comparison yourself. This is a good enough reason to NOT use the MD5 approach (as noted in my answer!).

Daniel Paull 2009-11-17 13:53:26

@Daniel Paull - good point, will edit :)

Jon Skeet 2009-11-17 13:54:55

@jack: When you call md5sum, that *is* generating a hash.

Jon Skeet 2009-11-17 13:56:04

@Jon: you might as well put my other good idea in your answer - compare file sizes before bothering to compare file contents ;)

Daniel Paull 2009-11-17 14:01:06

@Daniel - will do, although I'd hope that would be obvious :)

Jon Skeet 2009-11-17 14:22:24

@Dominic Bou-Samra: That is obviously wrong since if it was true, only 2^128 = 340282366920938463463374607431768211456 different file contents could exist. That is trivially not true, since you could create as many files, the content is the actual counter. Then create a file with the content "foobar" and it HAS to be mapped to one of the values above. See here: http://th.informatik.uni-mannheim.de/People/lucks/HashCollisions/ for two valid postscript files with the same MD5 sum.

Johannes Weiß 2009-11-17 14:29:53

@Jon: I'd *hope* that it was obvious enough that most developers would not miss this, however, I am no longer surprised when obvious optimisations are missed by developers... jaded I am.

Daniel Paull 2009-11-18 00:36:05

Answer 3

+3 A:

If you're on a system with md5sum, that's probably good enough.

You can do it with Python standard libraries -- checkout out hashlib.

Cory Petosky 2009-11-17 13:38:30

Answer 4

A:

Depends if you feel comfortable with the probability of collision on the MD5 algorithm. Just note it is highly unlikely: so yes, go ahead.

jldupont 2009-11-17 13:38:40

Answer 5

+7 A:

If you want to do more than just detect if they differ, or don't trust the hashing solution, there are modules called difflib and filecmp that doesn't rely on external programs.

Mattias Nilsson 2009-11-17 13:46:25

Just read the doco for filecmp - it certainly seems like the right approach for a Python app. I like the idea that "Files that were compared using this function will not be compared again unless their os.stat() signature changes." No doubt the filecmp.cmp() function is more efficient than rolling your own. I think this should be the accepted answer...

Daniel Paull 2009-11-17 14:09:58

Answer 6

+4 A:

Of course there is a simple test that you should do before comparing the file content at all - if the files are different sizes, then they can not possibly be the same.

Wouldn't it be more efficient to simply read each file and do a byte-by-byte comparison, avoiding the hashing algorithm altogether. This avoids the the (very unlikely) chance that two different files produce the same MD5 hash. Furthermore, you can bail out of the comparison when the first difference is detected, which for very different files will be very early in the comparison (possible on the first byte!)

Daniel Paull 2009-11-17 13:50:54

I agree. Simple filecmp would be faster than computing HASH.

Shailesh Kumar 2009-11-17 15:43:00

Answer 7

A:

If there is nobody maliciously trying to create collisions, then you would have to compare about 2⁶⁴ files before you would expect to see a collision by random chance. However, it is possible for someone to carefully construct two files with the same MD5 sum due to cryptographic weaknesses in MD5. Whether the cryptographic weaknesses of MD5 matter or not depends on your application, where the files come from, and what an attacker could stand to gain if he tricked your program into thinking two different files were identical. MD5 is still a very good checksum, just not so great as a cryptographic hash.

Theran 2009-11-17 13:53:08

Answer 8

A:

A hash is useful if you are going to cache it (to compare many different files with each-other). If you just want to compare two files, it's a monstrous waste of cycles. After all - both files will be read in, and a lot of processing will be used on every bite.

If it's a 1:1 compare, just use:

import filecmp
filecmp.cmp(file_name_1,file_name_2)

On the other hand, a good hash is the only way to compare a large number of files with each-other.

SHA-1 and MD5 sort of broken - but not for normal files. A few researchers can generate 2 nonsense files that might clash, but it's unlikely that anyone can clobber an existing file.

git uses SHA-1 to compare text, so it's not a terrible choice.

The following will all work:

import hashlib
hash = hashlib.MD5(your_text_here).hexdigest() # safe*
hash = hashlib.SHA1(your_text_here).hexdigest() # safe*
hash = hashlib.SHA224(your_text_here).hexdigest() # safe
hash = hashlib.SHA512(your_text_here).hexdigest() # paranoid

# now put the hash in a dictionary (or database) for your many-to-many comparison.

#  * Meaningful files will not be clobbered. Contrived files can be generated
#    which might clash together, but it's difficult to do.

wisty 2009-11-17 15:01:29

ansaurus

tags:

views:

answers:

How to detect whether two files are identical in Python

related questions