tags:

views:

239

answers:

7

Is MD5 hashing a file still considered a good enough method to uniquely identify it given all the breaking of MD5 algorithm and security issues etc? Security is not my primary concern here, but uniquely identifying each file is.

Any thoughts? Thanks a lot in advance :)

+3  A: 

It depends on what you are trying to achieve. It is extremely unlikely that any two non-identical files will have the same MD5 hash, but keep in mind that two files with the same content will have to the same MD5 hash. In fact an MD5 hash is commonly used to verify the integrity of files, since almost any change to a file will cause its MD5 hash to also change.

Daniel Vassallo
Then again same applies to any other hashing algorithms
Xeross
A: 

yes, for sure, MD5 produces a unique hash for every different input you provide and it is also used very much for calculating the hash of a file to keep it safe from injecting some malware or virus code in it..

SHA1 might be a good option for this purpose also.

Ummar
MD5 doesn't guarantee a unique hash. There is a non-zero probability that two different files will have the same key. The question is whether MD5 is still safe enough for non-security use. As for protecting against malware injection, it is now useless for that purpose because its security properties have been thoroughly demolished. AFAIK, a hacker on a laptop can generate a virus-infected version of a program with the same MD5 sum as the original in a few minutes.
Marcelo Cantos
The only hashing algorithm off the top of my head that would produce a unique hash for every input is one that produces infinitely-long hashes.
BoltClock
No, this answer is absolutely wrong. No hash function that maps infinite inputs to a fixed number of possible outputs can *guarantee* uniqueness. By definition.
Andrew Medico
@BoltClock - Perhaps you mean no hash at all. i.e. the hash is the original file which is not actually then a hash is it?
Matt H
+9  A: 

Yes. MD5 has been completely broken from a security perspective, but the probability of an accidental collision is still vanishingly small. Just be sure that the files aren't being created by someone you don't trust and who might have malicious intent.

Marcelo Cantos
could you elaborate on, what is broken in security perspective? what could not be achieved with md5 hashing? why is it not achievable?
none
@none: For your first question, see [here](http://en.wikipedia.org/wiki/MD5#Security). I'm afraid I don't understand the other questions.
Marcelo Cantos
@0xA3: Neither you nor I have any idea what files the OP is referring to, or how much damage a compromise would cause. It could be their kid's baby photo collection for all we know. My goal is to provide the facts; what someone else does with them is their business. Also consider that Bruce Schneier [recommends](http://www.schneier.com/blog/archives/2005/06/write_down_your.html) writing down your password; not everything needs to be stored at Fort Knox. Some things will keep just fine under the flower pot.
Marcelo Cantos
Add to all that the fact that MD5 hashes work much better than, say, SHA1 hashes as database keys since they fit neatly into a UUID column.
Marcelo Cantos
@Marcelo Cantos, I think what is lacking here is a differentiation or unpacking of the term 'security'. Obviously people are assuming 'security' for any use of checksum work, but the nomenclature Marcelo likely means is 'in a laboratory'.
hpavc
@hpavc: No, I don't mean in a laboratory. MD5 has been broken badly enough that a hacker can use a notebook computer to generate pairs of documents with matching hashes (a collision) in under a minute. They could use this to, e.g., present an honest program for audit purposes, and slip in a trojan once the audit is complete. Any system relying on MD5 hashes to prevent tampering would not notice this. I don't know if it is computationally feasible, yet, to forge a preexisting document (a 2nd pre-image attack). But no one with a shred of sense is going to bet the bank on that.
Marcelo Cantos
@hpvac: Also, by 'security', I don't mean *any* checksum work. A compiler cache might use MD5 (or even MD4!) to check whether input files match those of a previous build.
Marcelo Cantos
@Marcelo Cantos: You can use `CHAR(20)` for SHA-1 if you want to.
Gumbo
@Gumbo: You could, but that's 2.5 times the size of a UUID column.
Marcelo Cantos
@Marcelo Cantos: Sorry, I meant `CHAR(20)` for 160 bit.
Gumbo
@Gumbo: I wouldn't be game to use CHAR(20), since that would entail storing binary data in a column intended for text. BINARY(20) would suffice, but that comes back in C# as a `byte[]`, which is awkward to work with (e.g., no comparison operators). Similar problems probably occur in other languages. A `UNIQUEIDENTIFIER`, OTOH, maps onto a `Guid`, which is easily compared and is a value type, so less GC pressure.
Marcelo Cantos
A: 

MD5 has been broken, you could use SHA1 instead (implemented in most languages)

Guillaume Lebourgeois
+1  A: 

Personally i think people use raw checksums (pick your method) of other objects to act as unique identifiers way too much when they really want to do is have unique identifiers. Fingerprinting an object for this use wasn't the intent and is likely to require more thinking than using a uuid or similar integrity mechanism.

hpavc
+6  A: 

For practical purposes, the hash created might be suitably random, but theoretically there is always a probability of a collision, due to the Pigeonhole principle. Having different hashes certainly means that the files are different, but getting the same hash doesn't necessarily mean that the files are identical.

Using a hash function for that purpose - no matter whether security is a concern or not - should therefore always only be the first step of a check, especially if the hash algorithm is known to easily create collisions. To reliably find out if two files with the same hash are different you would have to compare those files byte-by-byte.

stapeluberlauf
But a hashing algorithm creates the hash by going through all the bytes, right? So don't you think the same purpose is achieved? :)
Ranhiru Cooray
@Ranhiru. No. The hash gives you a 'summary' value which (for MD5) is only 16 bytes long. To *guarantee* the files are identical you would need to make a byte by byte check. This is true no matter what hash algorithm you choose, there is always the possibility of a collision.
PaulG
But MD5 can be used and is used to uniquely identify files, is it? Is it safe to use MD5? Will i generate collisions eventually?
Ranhiru Cooray
@Ranhiru. Reread this answer, its imho the most comprehensive one here. Hashing could be used as a first step, which gets you to 99.99^e% certainty that the files are identical, but if you want to be *absolutely 100%* certain, then you'll need to make a byte by byte check. This is true whether you use MD5, SHA or any other algorithm.
PaulG
This answer is wrong. Prevention of tampering and verifying uniqueness are the same thing. Also, while hashing doesn't guarantee uniqueness, neither does actual comparison. In fact, the likelihood of a hash accidentally colliding is actually lower that the probability of the comparison failing due to glitches in the CPU generated by normal solar gamma ray emissions. And don't forget that often the only source of the file is sitting on the other side of the world inside a web server, and the only independent piece of information you have for comparison purposes is the hash.
Marcelo Cantos
Agree with Marcelo: telling that version A is (probably) the same as version B is *exactly* the same problem as telling that file X is (probably) the same file as file Y. A hash designed to be good (but not perfect) for one is a hash designed for the other.
Edmund
@Marcelo. It doesn't stand to logical reasoning that accidental collision is *less* likely than accidental bit flips (whilst making a byte by byte comparison). You still have the same chance of bit flips when building the hash (and arguably more since more processing time is involved). @Thomas raised the point originally to suggest that there is no guaranteed way of identifying uniqueness, though the impact of bit flips is highly debatable. The most pessimistic estimate is 1 flip per GB/hour, and ECC RAM would remove even that.
PaulG
@PaulG: That's not the point. The probability of an accidental collision at the mathematical level is much lower than an error due to a random bit flip (and ECC can't prevent bit-flips in the bus circuitry or the CPU core, btw). Thus, a byte-for-byte comparison would have almost no impact on the chances of getting it right. Besides, the answer is wrong even in principle, since the main purpose of a hash is to confirm the identity of a file when there is no trusted copy to check against, or it is intractable to do so (e.g., comparing a 100 GB file with a copy on the other side of the world).
Marcelo Cantos
This last point is worth reiterating. If you already had two full *alleged* copies of the file, A and A', then you might as well do a byte-for-byte comparison because it is at least as good as hashing and comparing but much faster.
GregS
+2  A: 

MD5 will be good enough if you have no adversary. However, someone can (purposely) create two distinct files which hash to the same value (that's called a collision), and this may or may not be a problem, depending on your exact situation.

Since knowing whether known MD5 weaknesses apply to a given context is a subtle matter, it is recommended not to use MD5. Using a collision-resistant hash function (SHA-256 or SHA-512) is the safe answer. Also, using MD5 is bad public relations (if you use MD5, be prepared to have to justify yourselves; whereas nobody will question your using SHA-256).

Thomas Pornin
This answer might be a bit misleading if the reader isn't too familiar with hashing. There is nothing magical about SHA that *prevents* hash collisions, they are just more resistant to hash collision *attacks*. If you wanted to be more than 99.999^e% certain that files are identical, you would still need a byte by byte check.
PaulG
Actually a byte-to-byte comparison may fail due to a cosmic ray flipping a bit (e.g. transforming a `return 0;` into a `return 1;`). This is highly unlikely, but the risk of a collision with SHA-256 is even smaller than that. Mathematically, you cannot be sure that two files which hash to the same value are identical, but you cannot be sure of that either by comparing the files themselves, as long as you use a computer for the comparison. What I mean is that it is meaningless to go beyond some 99.999....9% certainty, and SHA-256 already provides more than that.
Thomas Pornin
What, you don't use ECC memory? ;). Good comment, very interesting thoughts.
PaulG