Can I prevent duplicate content using md5?

tags:

hashing
md5

views:

537

answers:

+3 Q:

Can I prevent duplicate content using md5?

I would like to prevent duplicate content. I do not want to keep a copies of content, so I decided to keep just the md5 signatures.

I read that md5 collisions do happen, different content could give in the same md5 signature.

Do you think md5 is enough?

Should I use md5 and sh1 together?

+3 A:

MD5 should be fine, collisions are very rare, but if you're really worried, you can use sha-1 as well.

Though I guess the signatures really aren't that large, so if you have the spare processing cycles and the disk space, you could do both. But if space or speed is limited, I'd just go with one.

samoz 2009-07-13 19:58:00

MD5 is not fine because it's now easy to intentionally collide.

Steven Sudit 2009-07-13 19:58:25

So? Legitmate content is posted first -> maluser creates intentional collision -> maluser is denied ability to post content.

Spencer Ruport 2009-07-13 20:00:19

Try that again with legitimate content posted second.

Steven Sudit 2009-07-13 20:02:22

Maluser creates intentional collision for content that hasn't been posted (<-Magic???) -> Legitimate user tries to post content -> Legitimate user is denied ability to post content.

Spencer Ruport 2009-07-13 20:06:38

Do the malusers know what's going to be posted? :) I think there'd be a much bigger issue if that were the case.

Spencer Ruport 2009-07-13 20:07:39

Maluser downloads content from the canonical source and crafts replacement with an identical MD5 digest, uploading it before the legitimate version.

Steven Sudit 2009-07-13 20:07:56

Spencer, we don't know from the question whether the material being posted is original. In fact, we don't know much of anything about the security context yet. I'm going to wait for the OP to explain.

Steven Sudit 2009-07-13 20:11:28

... okay when he says content I'm thinking Web 2.0 content (e.g. User generated, e.g. this comment) so the only canonical source would be the user's brain. Maybe we're not on the same page here.

Spencer Ruport 2009-07-13 20:11:44

I mean, hashes are basically just large numbers. I'm thinking of a number right now. Try to collide with it.

Spencer Ruport 2009-07-13 20:14:26

@Spencer: I bet the answer is 42

Treb 2009-07-13 20:16:07

@Spencer: Our apparent disagreement here stems from different assumptions about what the OP meant by "content", so we're definitely on two different pages. If it's user-generated, then I agree that even a non-cryptographic 64-bit hash would be adequate for pretty much any realistic context. In fact, I recently recommended exactly this (http://stackoverflow.com/questions/1096558/uniquely-identifying-urls-with-one-64-bit-number).

Steven Sudit 2009-07-13 20:28:38

Just add a salt to the file hash, like this question http://stackoverflow.com/questions/878837/salting-a-c-md5-computehash-on-a-stream

Simeon Pilgrim 2009-07-13 23:11:47

I'm all for salt, but I don't think it would work here. If you used a different salt for each piece of content, then you couldn't detect duplicates. And if you used the same salt and it became known, then it would offer no defense. Essentially, all the salt would do is slightly obscure the hashing algorithm.

Steven Sudit 2009-07-14 01:32:41

MD5 is broken and SHA1 is close to it. Use SHA2.

edit

Based on an update from the OP, it doesn't seem that intentional collisions are a serious concern here. For unintentional ones, any decent hash with at least a 64-bit output would be fine.

I would still avoid MD5 and even SHA1, in general, but there's no reason to be dogmatic about it. If the tool fits here, then by all means use it.

Steven Sudit 2009-07-13 19:58:02

Could you link to some evidence that SHA-1 is broken like MD5?

Bob Somers 2009-07-13 19:58:43

I think that in this case, SHA1 is overkill. He isn't trying to secure anything (like passwords), but prevent duplicates. MD5 is fine for this purpose.

Thomas Owens 2009-07-13 19:59:06

It's close to it, but not yet broken. See http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-1

Steven Sudit 2009-07-13 20:03:30

@Thomas, the SHA2 family comes with a variety of digest sizes, which allow tuning to avoiding the birthday paradox. Even the smallest is 256 bits, as opposed to MD5's 128. I also don't know enough about the context, but it if intentional collisions are a risk then MD5 is definitely ruled out.

Steven Sudit 2009-07-13 20:05:52

But when preventing duplicates, the worst that would happen with an intentional collision is that something that isn't a duplicate would be marked as a duplicate. I think the smaller digest size and rarity of collisions anyway is well worth it.

Thomas Owens 2009-07-13 20:25:48

@Thomas Owens: Indeed. Depending on the security context, this is either no big deal or a catastrophe. I would very much like to know more about the context.

Steven Sudit 2009-07-13 21:15:31

A timestamp + md5 together are safe enough.

Stiropor 2009-07-13 19:58:58

It really depends on the content; e.g. you wouldn't want to use timestamp for an image, because two images may be otherwise byte for byte duplicates, yet have different create/modified timestamps.

pdwetz 2009-07-13 20:02:15

+1 A:

md5 should be enough. Yes, there can be collisions, but the chances of that happening are so incredibly small that I wouldn't worry about it unless you were literally tracking many billions of pieces of content.

Eric Petroelje 2009-07-13 19:59:05

+5 A:

People have been able to deliberately produce MD5 collisions under contrived circumstances, but for preventing duplicate content (in the absence of malicious users) it's more than adequate.

Having said that, if you can use SHA-1 (or SHA-2) you should - you'll be fractionally but measurably safer from collisions.

RichieHindle 2009-07-13 19:59:46

Even in the presence of malusers this should be fine shouldn't it? Legitmate content is posted first -> maluser creates intentional collision -> maluser is denied ability to post content

Spencer Ruport 2009-07-13 20:02:02

(The comment sequence on samoz's answer here: http://stackoverflow.com/questions/1121701/can-i-preventing-duplicate-content-using-md5#1121713 addresses Spencer's comment.)

RichieHindle 2009-07-13 23:19:41

+3 A:

This answers your question -

http://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions

vs 2009-07-13 20:00:17

If you're really afraid of accidental collisions just do both MD5 and SHA1 hashes and compare them. If they both match, it's the same content. If either one differs, it's different content.

Spencer Ruport 2009-07-13 20:16:01

+1 A:

Why not simply compare the content byte for byte if there is a hash collision? hash collisions are very rare, and so you're only going to have to do a byte for byte check very rarely. That way duplicates will only be detected if the items are actually duplicated

Martin 2009-07-13 20:20:10

+1 A:

Combining algorithms serves to only obfuscate, but does not increase security in a hashing algorithm.

MD5 is too broken to use anyway, IMHO. Forging MD5 hashes is proven by researchers, where they demonstrated being able to forge content that generates an MD5 collision, thereby opening the door to generating a forged CSR to buy a cert from RapidSSL for a domain name they don't own. Security Now! episode 179 explains the process.

For me, SHA-based hashes are stronger and most development platforms support it so the choice is easy. The remaining deciding factor is then the block size.

spoulson 2009-07-13 20:35:36

I would only add that SHA-1 has shown signs of weakness, so SHA-2 is generally recommended.

Steven Sudit 2009-07-13 21:21:51

ansaurus

tags:

views:

answers:

Can I prevent duplicate content using md5?

related questions