views:

270

answers:

4

I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.

Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.

So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.

What's the right way of doing this?

edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:

import hashlib

def md5_from_file (fileName, block_size=2**14):
    md5 = hashlib.md5()
    f = open(fileName)
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    f.close()
    return md5.hexdigest()
A: 

Hint: Think of how a hash table works.

kwatford
You're right, but he's not going to get it.
Ignacio Vazquez-Abrams
Oh dear, it looks like this is another question without an answer...
Cawas
+2  A: 

The issue with MD5 is that it's broken. For most common uses there's little problem and people still use both MD5 and SHA1, but I think that if you need a hashing function then you need a strong hashing function. To the best of my knowledge there is still no standard substitute for either of these. There are a number of algorithms that are "supposed" to be strong, but we have most experience with SHA1 and MD5. That is, we (think) we know when these two break, whereas we don't really know as much when the newer algorithms break.

Bottom line: think about the risks. If you wish to walk the extra mile then you might add extra checks when you find a hash duplicate, for the price of the performance penalty.

wilhelmtell
In this case, the strength of the hashing function is immaterial. MD5 will absolutely prevent duplicates to a virtual mathematical certainty.
Chase Seibert
What does "strength of the hashing function is immaterial" mean? The current attacks against MD5 let you generate collisions in a second on a single CPU -- so no, MD5 will not prevent "duplicates"
intgr
As has been said elsewhere MD5 does not _prevent_ duplicates/collisions, though it does make them rather unlikely. Also, MD5 is only "broken" in the sense that its cryptographically insecure - a determined attacker could create a collision if desired. For the purposes of the original question, though, cryptographic security isn't necessary, so that's not a valid reason for dismissing MD5.
David Zaslavsky
A hash function that generates a fixed-length hash is _inherently_ prone to collisions. It's just that we don't want to ever find two distinct texts that hash to the same value. We know the collisions are there, we just don't want to ever find them. Much worse, there's a problem when given _any arbitrary_ text we can generate another _distinct_ text that hashes the same; or when given a hash we can generate a text that hashes to the same hash value. How serious this is for the OP only the OP knows: I think it's a matter of risk assessment.
wilhelmtell
+3  A: 

The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.

Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.

See http://en.wikipedia.org/wiki/Birthday_Paradox

Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.

dash-tom-bang
+3  A: 

Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.

Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.

# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)

def store_file(file, nchunks, hash):
  "" Tells you whether there is another file with the same contents already, by 
     making a table lookup ""
  # This can be a DB lookup or some way to obtain your hash map
  big_table = ObtainTable()

  # Two level lookup table might help performance
  # Will vary on the number of entries and nature of big_table
  if nchunks in big_table:
     if hash in big_table[hash]:
       raise DuplicateFileException,\
         'File is dup with %s' big_table[nchunks][lookup_hash]
  else:
    big_table[nchunks] = {}

  big_table[nchunks].update({
    hash: file.filename
  })

  file.save() # or something

To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.

Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.

Jj
Well, my case is actually about big images and large videos, and performance is quite an issue. But yeah, I'm not expecting it to detect two slightly different angles of the same scene, for instance.
Cawas
This is definitely the best answer. If the OP wants better than SHA1 though, rather than concatenating, he should just use SHA2.
Carson Myers
Adding more data onto the hash just changes your hash function (e.g. this answer says "append some other value onto what's returned from MD5 to generate a longer hash"). Sometimes this is the easiest, but you could also just generate a longer hash in the first place. Alas- longer hashes are not a prevention against collisions.
dash-tom-bang
#dash how to do to generate a longer hash? I could see nothing pointing to that on the docs about hashlib.
Cawas
ops, I meant @dash
Cawas
You can trivially get a longer hash by sticking the results of two (independent) hash functions one after the other. E.g. if you take the MD5 return value, and then concatenate onto it the file length, then you have a longer hash. File length isn't a good hashing algorithm, but it is useful as an example. I don't know anything about any public hashing APIs, though, so I don't know what you would/could feed them, although I suspect that MD5'ing the data backwards would give results that would be unrepeatable enough, that concating the forward and backward hashes would be double strength.
dash-tom-bang