views:

588

answers:

8

I want to generate unique filenames per image so I'm using MD5 to make filenames.Since two of the same image could come from different locations, I'd like to actually base the hash on the image contents. What caveats does this present?

(doing this with PHP5 for what it's worth)

+1  A: 

Seems fine to me, if you're ok with 32-character filenames.

Edit: I wouldn't use this as the basis of (say) the FBI's central database of terrorist mugshots, since a sufficiently motivated attacker could probably come up with an image that had the same MD5 as an existing one. If that was the case then you could use SHA1 instead, which is somewhat more secure.

+3  A: 

It's a good approach. There is an extremely small possibility that two different images might hash to the same value, but in reality your data center has a greater probability of suffering a direct hit by an asteroid.

One caveat is that you should be careful when deleting images. If you delete an image record that points to some file and you delete the file too, then you may be deleting a file that has a different record pointing to the same image (that belongs to a different user, say).

Greg Hewgill
A: 

If you have two identical images loaded from different places, say a stock photo, then you could end up over-writing the 'original'. However, that would mean you're only storing one copy, not two.

With that being said, I don't see any big issues with doing it in the way you described.

warren
A: 

It will be time consuming. Why don't you just assign them sequential ids?

Paul Tomblin
Because if two people upload the same image, I don't want to store it twice.
Ben Throop
+1  A: 

You could use a UUID instead?

johnstok
A: 

You might want to look into the technology P2P networks use to identify duplicate files. A solution involving MD5, SHA-1, and file length would be pretty reliable (and probably overkill).

Draemon
+3  A: 

Given completely random file contents and a good cryptographic hash, the probability that there will be two files with the same hash value reaches 50% when the number of files is roughly 2 to (number of bits in the hash function / 2). That is, for a 128 bit hash there will be a 50% chance of at least one collision when the number of files reaches 2^64.

Your file contents are decidedly not random, but I have no idea how strongly that influences the probability of collision. This is called the birthday attack, if you want to google for more.

It is a probabilistic game. If the number of images will be substantially less than 2^64, you're probably fine. If you're still concerned, using a combination of SHA-1 plus MD5 (as another answer suggested) gets you to a total of 288 high-quality hash bits, which means you'll have a 50% chance of a collision once there are 2^144 files. 2^144 is a mighty big number. Mighty big. One might even say huge.

DGentry
+2  A: 

You should use SHA-1 instead of MD5, because MD5 is broken. There are pairs of different files with the same MD5 hash (not theoretical; these are actually known, and there are algorithms to generate even more pairs). For your application, this means someone could upload two different images which would have the same MD5 hash (or someone could generate such a pair of images and publish them somewhere in the Internet such that two of your users will later try to upload them, with confusing results).

CesarB