views:

1155

answers:

8

I've got an image library on Amazon S3. For each image, I md5 the source URL on my server plus a timestamp to get a unique filename. Since S3 can't have subdirectories, I need to store all of these images in a single flat folder.

Do I need to worry about collisions in the MD5 hash value that gets produced?

Bonus: How many files could I have before I'd start seeing collisions in the hash value that MD5 produces?

A: 

Doesn't really matter how likely it is; it is possible. It could happen on the first two things you hash (very unlikely, but possible), so you'll need to support collisions from the beginning.

Karg
There may of course be many other bad things which can happen with a probability of 1/2^128. You might not want to single-out this one to worry about.
Will Dean
The worst thing that can happen here is you can get a photo. For a relatively small number I would not worry. Now if your software is controlling an autopilot landing an aircraft, thats another story.
Jim C
You can't be serious. You'll need to hash 6 billion files per second, every second for 100 years to get good chance of collision. Even if you're very very unlucky, it would probably take more than entire capacity of S3 used for longer than a human lifetime.
porneL
It's billions of times more likely that your database and its backups will all fail. Collisions are not worth worrying about.
Artelius
+4  A: 

A rough rule of thumb for collisions is the square-root of the range of values. Your MD5 sig is presumably 128 bits long, so you're going to be likely to see collisions above and beyond 2^64 images.

Will Dean
You probably mean 128 bits, not 2^128. :-)
JesperE
http://en.wikipedia.org/wiki/Birthday_ProblemSome more information about the problem.
Georg
+6  A: 

S3 can have subdirectories you know. Just put a / in the key name, and you can access the files as if they were in separate directories. I use this to store user files in separate folders based on their user ID in S3. Eg "mybucket/users/1234/somefile.jpg". It's not exactly the same as a directory in a filesystem, but the S3 API has some features that let it work almost the same. I can ask it to list all files that begin with "users/1234/" and it will show me all the files in that "directory"

davr
+2  A: 

So wait, is it:

md5(filename) + timestamp

or:

md5(filename + timestamp)

If the former, you are most of the way to a GUID, and I wouldn't worry about it. If the latter, then see Karg's post about how you will run into collisions eventually.

Ryan
+2  A: 

While there have been well publicized problems with MD5 due to collisions, UNINTENTIONAL collisions among random data are exceedingly rare. On the other hand, if you are hashing on the file name, that's not random data, and I would expect collisions quickly.

acrosman
The only problem I have with taylors example is that if someone gets a copy of your database they could probably figure out the credit card numbers using a rainbow table ...
Sam Saffron
While I wouldn't choose to use MD5 for credit cards, a Rainbow table of all valid credit card numbers between 10,000,000 (8 digits being the smallest length credit card I've seen) and 9,999,999,999,999,999 (largest 16 digit number) is still a big table to generate. There are probably easier ways to steal those numbers.
acrosman
A: 

The wikipedia entry on MD5 is pretty helpful here: http://en.wikipedia.org/wiki/MD5

warren
+18  A: 

Because of birthday paradox probability of collision is 1/264. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

porneL
+1 for adding the calculation. This is slightly more accurate: `http://www.google.com/search?q=2^64%2F100*(seconds+per+year)`
Mathias Bynens
+1  A: 

Although random MD5 collisions are exceedingly rare, if your users can provide files (that will be stored verbatim) then they can engineer collisions to occur. That is, they can deliberately create two files with the same MD5sum but different data. Make sure your application can handle this case in a sensible way, or perhaps use a stronger hash like SHA-256.

bdonlan