tags:

views:

537

answers:

9

I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.

I read that md5 collisions do happen, different content could give in the same md5 signature.

Do you think md5 is enough?

Should I use md5 and sh1 together?

+3  A: 

MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.

Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.

samoz
MD5 is not fine because it's now easy to intentionally collide.
Steven Sudit
So? Legitmate content is posted first -> maluser creates intentional collision -> maluser is denied ability to post content.
Spencer Ruport
Try that again with legitimate content posted second.
Steven Sudit
Maluser creates intentional collision for content that hasn't been posted (<-Magic???) -> Legitimate user tries to post content -> Legitimate user is denied ability to post content.
Spencer Ruport
Do the malusers know what's going to be posted? :) I think there'd be a much bigger issue if that were the case.
Spencer Ruport
Maluser downloads content from the canonical source and crafts replacement with an identical MD5 digest, uploading it before the legitimate version.
Steven Sudit
Spencer, we don't know from the question whether the material being posted is original. In fact, we don't know much of anything about the security context yet. I'm going to wait for the OP to explain.
Steven Sudit
... okay when he says content I'm thinking Web 2.0 content (e.g. User generated, e.g. this comment) so the only canonical source would be the user's brain. Maybe we're not on the same page here.
Spencer Ruport
I mean, hashes are basically just large numbers. I'm thinking of a number right now. Try to collide with it.
Spencer Ruport
@Spencer: I bet the answer is 42
Treb
@Spencer: Our apparent disagreement here stems from different assumptions about what the OP meant by "content", so we're definitely on two different pages. If it's user-generated, then I agree that even a non-cryptographic 64-bit hash would be adequate for pretty much any realistic context. In fact, I recently recommended exactly this (http://stackoverflow.com/questions/1096558/uniquely-identifying-urls-with-one-64-bit-number).
Steven Sudit
Just add a salt to the file hash, like this question http://stackoverflow.com/questions/878837/salting-a-c-md5-computehash-on-a-stream
Simeon Pilgrim
I'm all for salt, but I don't think it would work here. If you used a different salt for each piece of content, then you couldn't detect duplicates. And if you used the same salt and it became known, then it would offer no defense. Essentially, all the salt would do is slightly obscure the hashing algorithm.
Steven Sudit
A: 

MD5 is broken and SHA1 is close to it. Use SHA2.

edit

Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.

I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.

Steven Sudit
Could you link to some evidence that SHA-1 is broken like MD5?
Bob Somers
I think that in this case, SHA1 is overkill. He isn't trying to secure anything (like passwords), but prevent duplicates. MD5 is fine for this purpose.
Thomas Owens
It's close to it, but not yet broken. See http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1
Steven Sudit
@Thomas, the SHA2 family comes with a variety of digest sizes, which allow tuning to avoiding the birthday paradox. Even the smallest is 256 bits, as opposed to MD5's 128. I also don't know enough about the context, but it if intentional collisions are a risk then MD5 is definitely ruled out.
Steven Sudit
But when preventing duplicates, the worst that would happen with an intentional collision is that something that isn't a duplicate would be marked as a duplicate. I think the smaller digest size and rarity of collisions anyway is well worth it.
Thomas Owens
@Thomas Owens: Indeed. Depending on the security context, this is either no big deal or a catastrophe. I would very much like to know more about the context.
Steven Sudit
A: 

A timestamp + md5 together are safe enough.

Stiropor
It really depends on the content; e.g. you wouldn't want to use timestamp for an image, because two images may be otherwise byte for byte duplicates, yet have different create/modified timestamps.
pdwetz
+1  A: 

md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.

Eric Petroelje
+5  A: 

People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.

Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.

RichieHindle
Even in the presence of malusers this should be fine shouldn't it? Legitmate content is posted first -> maluser creates intentional collision -> maluser is denied ability to post content
Spencer Ruport
(The comment sequence on samoz's answer here: http://stackoverflow.com/questions/1121701/can-i-preventing-duplicate-content-using-md5#1121713 addresses Spencer's comment.)
RichieHindle
A: 

If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.

Spencer Ruport
+1  A: 

Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated

Martin
+1  A: 

Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.

MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.

For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.

spoulson
I would only add that SHA-1 has shown signs of weakness, so SHA-2 is generally recommended.
Steven Sudit