views:

500

answers:

6

I have a website where users can upload their files; these are stored on the server and their metadata recorded in a database. I'm implementing some simple integrity checks, i.e. "is the content of this file now byte-for-byte identical as when it was uploaded?"

An example: for content of userfile.jpg, MD5 hash is 39f9031a154dc7ba105eb4f76f1a0fd4 and SHA-1 hash is 878d8d667721e356bf6646bd2ec21fff50cdd4a9. If this file's content changes, but has the same MD5 hash before and after, is it probable that the SHA-1 hash will also stay the same? (With hashing, sometimes you can get a hash collision - could this happen with two different hashing algorithms at once?)

Or is computing two different hashes for a file pointless (and I should try some other mechanism for verifying integrity)?


Edit: I'm not really worried about accidental corruption, but I'm supposed to prevent users changing the file unnoticed (birthday attack and friends).

I'll probably go with one hash, SHA-512 - the checks don't happen that often to be a performance bottleneck and anyway, "As Bruce Schneier says, there's enough fast, insecure systems out there already. –@MichaelGG in the comments".

+4  A: 

Checking the MD5 hash by itself is sufficient for most purposes. Although if you must, there is no harm in checking the SHA1 in addition. Keep in mind the possibility of catching something you would miss with just the MD5 check is extremely remote.

Note that in terms of scalability, the additional check adds unnecessary load on your server.

Yuval A
+7  A: 

MD5 is probably safe for what you're doing, but there's no reason to continue to use a hash with known flaws. In fact, there's no reason you shouldn't be usign SHA256 or SHA512, unless you have some known major performance bottleneck.

Edit: To clarify, there's no reason to use two algorithms; just use one that fits what you need. If you're worried about people doing MD5 collisions on you (as in, is this a security threat?), then use an algorithm that isn't as weak, such as SHA256.

Edit 2: To address an apparently still common misunderstanding: Finding a random collision on a hash is not a 1/2^n probability. It's closer to 1/2^(n/2). So a 128-bit hash can probably be collided with 2^64 attempts. See birthday attack for details.

MichaelGG
also, using two hashes doesn't add the number of bits getting exponentially harder. cracking two hashes at the same time is just as easy (or hard) as cracking each separately. ditch MD5, go for SHA-x alone.
Javier
"Finding a random collision on a hash is not a 1/2^n probability" True if the goal is to come up with a pair of collided documents. False if the goal is to come up with a document that collides with a given document.
Jason S
@Jason S: This doesn't rule out scenario 1) generate two collided documents, 2) write one 3) overwrite with the other 4) profit. (assuming a vulnerable function, e.g. MD5)
Piskvor
let me rephrase: "given" means out of your control.
Jason S
+1  A: 

In general, if the MD5 hashes don't match, the SHA1 (or any other similar hash) won't match either. I'm not going to say there aren't possible cases where it couldn't happen (because we all know there are collisions in both algorithms), but I would say it will probably never happen in your situation.

My thoughts are that providing one hash is probably sufficient enough; more than one hash becomes arduous to verify (having to verify one is bad enough, depending upon available utilities for the platform), and I seriously doubt you're going to see such amazing corruption of a file as to lead to a perfect collision.

Note: Ignore the stuff about verification being a pain; upon re-reading the question, I revised this - I took the original meaning to be hash verification for users downloading the file. If, of course, that is what was meant, then what I said still applies, I think.

Rob
Not what I had in mind originally, but good point: should the files be downloadable, I can provide checksums for the users as they are pre-computed. Nice idea!
Piskvor
+1  A: 

As a rough estimate, chance for a MD5 false positive is 1/(2^128), chance for a SHA-1 false positive is 1/(2^160), so the chance for a false positive for both algorithms is between 1/(2^128) and 1/(2^288), but you can be pretty sure that it is near 1/(2^288) as both algorithms have been thoroughly tested statistically.

At least, when using two different hashes, you are protected very well against intentional attacks at one of the algorithms.

EDIT: After some research, I stumbled upon this Wikipedia Note that MD5 birthday attacks can be done in under 1 minute, so it seems better to use a different algorithm as MD5 together with SHA-1 here. Birthday attacks for SHA-1 take 2^69 operations at the moment.

schnaader
Except MD5 has known problems and has already led to real-world collision attacks.
MichaelGG
Additionally, a random collision is not 1/2^n, it's roughly 1/2^(n/2). See birthday paradox.
MichaelGG
+1  A: 

Because the two hashes are calculated differently, two files with the same MD5 hash are no more likely to have the same SHA-1 hash than two random files. If your chance of random collision with either hash is (ballpark) 2^128, your chance of random collision in both will be 2^256.

In effect, you go from extremely low to extremely, extremely low.

It's the equivilent of going from 128-bit to 256-bit encryption in order to avoid having someone randomly guess your 128-bit key.

Jekke
+2  A: 

For file integrity (e.g. accidental/random corruption), one hash should suffice. 128 bits = 2-128 probability of an undetected error, which is for all practical purposes small enough.

For file cryptographic integrity (e.g. assurance that someone hasn't maliciously substituted an alternate file), I think you're talking about a belt-and-suspenders approach.

MD5 is considered "weak" in the sense that it is possible to construct two documents with the same hash with a much lower amount of CPU time needed than it would take for a brute-force search ("collision resistance" of MD5 has been broken).

But it's not (as far as I know) "weak" from the standpoint of, if you have an arbitrary document X, someone else can create a document Y with the same hash with a much easier time than a brute-force search (MD5 still has "preimage resistance"). (The distinction is like the difference between going to a party and finding two people with the same birthday, vs. finding another person with the same birthday as yours.)

Even if MD5 is broken in that regard, it's improbable that someone can come up with an algorithm to create documents to match an arbitrary MD5 hash and an arbritrary SHA1 hash.

This sounds kind of like the tension between the two maxims "don't put all your eggs in one basket" vs. "put all your eggs in one basket, and watch the basket". Or like spending money on two deadbolt locks vs. one deadbolt lock which is twice as good and costs twice as much. Ideally it would be best to spend CPU time calculating one secure 256-bit hash instead of two less secure 128-bit hashes using different algorithms. (yes I know SHA1 is 160bit, this is just an illustration) You're more likely to get better performance this way for a desired level of security -- that is, if the 256-bit hash isn't broken. If it is broken, you may be better off with the two-algorithm approach just to hedge your bets.

But again if this is just integrity to protect against errors, one MD5 hash is fine.

edit: to cite some useful sources: 1 2 3, "MD5 considered harmful today", RFC4270, NIST's latest update on the SHA-3 competition, and "The SHA-3 Zoo".

Jason S
Exactly. MD5 is safe for some things, but without details, there's no reason to use it.
MichaelGG