ansaurus

Question

What algorithm should be used when doing filechecksums to find dupes?

Answer 1

+1 A:

SHA1 is slightly better as a checksum than MD5. It is what Git uses.

slebetman 2010-01-03 02:11:21

Answer 2

+1 A:

MD5 has known vulnerabilities at this point, but that may not be a problem for your application. It's still reasonably good for distinguishing piles of bits. If something comes up with no match, then you know you haven't already seen it, since the algorithm is deterministic. If something comes back as a match, you should actually compare it to the blob that it ostensibly matched before acting as if it's really a duplicate. MD5 is relatively fast, but if you can't afford full-text comparisons on hash collisions, you should probably use a stronger hash, like SHA-256.

Novelocrat 2010-01-03 02:12:11

Answer 3

+5 A:

In this particular case, choice of algorithm probably isn't that significant. The key reasons for using SHA1 over MD5 all relate to creating cryptographically secure signatures.

MD5 should be perfectly acceptable for this task, as you probably don't need to worry about people maliciously crafting files to generate false duplicates.

Adam Luchjenbroers 2010-01-03 02:12:13

Answer 4

A:

While MD5 does have a few collisions, I've always used it for files and it's worked just fine.

dmitrig01 2010-01-03 02:12:30

Answer 5

+1 A:

For the describe purpose there is no real preferable solution, both hash-functions will solve the problem. Anyway, MD5 will usually be slightly faster than SHA1.

Example in python:

#!/usr/bin/env python

import hashlib, cProfile

def repeat(f, loops=10000000):
    def wrapper(): 
        for i in range(loops): f()
    return wrapper

@repeat
def test_md5():
    md5 = hashlib.md5(); md5.update("hello"); md5.hexdigest()

@repeat 
def test_sha1():
    sha = hashlib.sha1(); sha.update("hello"); sha.hexdigest()

cProfile.run('test_md5()')
cProfile.run('test_sha1()')

#
#         40000004 function calls in 59.841 CPU seconds
# 
# ....
#
#         40000004 function calls in 65.346 CPU seconds
# 
# ....

The MYYN 2010-01-03 02:27:03

Answer 6

+1 A:

What you are talking about is a checksum, which is related to (but not the same) as a cryptographic hash.

Yes, both MD5 and even CRC work just fine as checksums, as long as you are not concerned with a malicious user intentionally crafting two different files with the same checksum. If that is a concern, use SHA1 or, even better, some cryptographically unbroken hash.

BlueRaja - Danny Pflughoeft 2010-01-03 02:33:36

Answer 7

+1 A:

If you care about performances I think it would be better to check for matching file size first, then using a fast hash function (CRC32 or MD5 which should be faster than SHA1) and for possible duplicated files found this way trying with MD5, SHA1 or SHA256 (depending on the criticality of the task).

Andrea Zilio 2010-01-03 02:49:52

Answer 8

A:

We use MD5 at my work for exactly what you're considering. Works great. We only need to detect duplicates uploads on a per-customer basis, which reduces our exposure to the birthday problem, but md5 would still be sufficient for us if we had to detect duplicates across all uploads rather than per customer. If you can believe the internet, the probability p of a collision given n samples and a hash size of b is bounded by:

p <= n (n - 1) / (2 * 2 ^ b)

A few years back I ran this calculation for n = 10^9 and b = 128 and came up with p <= 1.469E-21. To put that in perspective, 10^9 files is one per second for 32 years. So we don't compare files in the event of a collision. If md5 says the uploads were the same, they're the same.

Wayne Conrad 2010-01-03 06:17:20

ansaurus

tags:

views:

answers:

What algorithm should be used when doing filechecksums to find dupes?

related questions