views:

4064

answers:

14

I've heard that MD5 is "broken" (in the context of password encryption). But I don't understand why! I've read the theory, but can't see it happening in practice...

I have an MD5 hash 99e9446e78aac2056d3903e1adb8fbcd
And a simple bit of code to produce it

$salt="#bh35^&Res%";
$pass="***"; //number of characters is not equal to number of *
echo $hash=md5($salt.$pass);

So, my question is:

Is MD5 really that bad? And if so, what's the password behind the asterisks? (or another word which will produce the same hash)

I'm looking to get past theoretical reasoning, and come up with a simple example that we can use to demonstrate the problem(s) with MD5.

EDIT: I have changed the pass to make it invulnerable to dictionary search.
So, the hash got changed as well.

FINAL: Well, because I was forced to accept something, I accepted must funny answer.
The pass was s4mep8ss

+3  A: 

You are using salted hashes, which is good, because it makes it nearly impossible to do a simple reverse-lookup on a hash, even for common words. MD5 has a higher chance of collisions versus SHA1 though. This means that it's more likely that you'll get the same hash for 2 different strings with MD5, although it's still extremely unlikely. However, sha1() is a drop-in replacement for md5(), so why wouldn't you use it?

Harold1983-
Because SHA1 is broken as well: http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
Jacco
Ok ok. Higher chance of collision. I've heard thet nearly thousand times. So, what's the password?
Col. Shrapnel
SHA-1 is not a drop-in replacement. It has 160 bits output, not 128. Have fun dropping that in without causing problems.
Joey
Or use sha256, since sha1 is also "broken"
Stephen P
@Jacoo, 2^69 operations to find a single collision? THAT'S AN AWFUL LOT OF OPERATIONS. Fine, if you insist, SHA-512. Break that >.>
iconiK
Good point Johannes - sha1 hashes will be 40 character, md5 are 32.
Harold1983-
As for the salt, of course I am using it because of rainbow tables which are equal danger for ANY hashing method
Col. Shrapnel
+30  A: 

To address your points directly:

1) Yes, many collision vulnerabilities have been documented. The bottom line is that MD5 is fundamentally broken. According to Wikipedia:

US-CERT of the U. S. Department of Homeland Security said MD5 "should be considered cryptographically broken and unsuitable for further use," and most U.S. government applications will be required to move to the SHA-2 family of hash functions after 2010.

Bruce Schneier has also weighed in with regard to one such attack on MD5 to forge SSL certificates:

I'm not losing a whole lot of sleep because of these attacks. But -- come on, people -- no one should be using MD5 anymore.

2) Its impossible to determine what is behind the asterisks. The point of "breaking" MD5 is not to determine the original password value - it is to find a collision. If I can find another value that will hash to the same MD5 string, then that is as good as finding your original password. Of course, salting makes MD5 stronger, but it is still "broken".

Justin Ethier
Okay okay, it is broken. Well 1. Find a collision please. 2. show me a scenario of using that collision, to get access to the admin area on my site
Col. Shrapnel
@Col. Shrapnel, I get the feeling you have already made up you mind.. and are just looking for an excuse not to implement something safe.
Jacco
1. Find it yourself: http://www.woodmann.com/collaborative/tools/index.php/SnD_Reverser_Tool2. It's not our job to break into your site, even if anyone here really cared enough too. If you want to rely on MD5, you can have fun trying to sleep at night knowing your security is fundamentally broken.
peachykeen
I break into your site and steal the hashed version of your passwds - the point of hashing is that those hashes are no use to me, I can't get the passwds from them. The flaw in MD5 is that I can find another word with the same MD5 hash without trying all possible combinations - I can then use that new word to log into your site. In practice it's not a problem - I still have to do a hell of a lot of tests, just not as many as you would think. And the collision I come up with may be 10k long which I can't use as a passwd anyway.
Martin Beckett
@Martin: Quite frankly, the scenario you gave isn't plausible. If the hacker has access to get the hashed version of the passwords, then they probably already have access to UPDATE one of those passwords. Considering the salt is typically stored in the same record as the password itself, it would be trivial to replace the users password (and salt value) with my own.
Chris Lively
@Justin Ethier: So how would you go about generating a collision `B` for a digest `A` such that `md5(B + salt) = A`? *This* is what Col. Shrapnel is asking. It's all very well having a collision to the MD5 digest, but this is useless from a practical point of view because it has to be salted first.
Will Vousden
@Chris Lively - that depends on the site security! I was just explaining how/why/if MD5 collisions are an issue in this case.Personally I wouldn't worry.
Martin Beckett
@Chris Lively: Unless a bug in the page or some other problem gives you a partial view of the database. It's more likely you'll get a view of the db, especially on something like a PHP database connection call or a db query. If the user sees parts of the call, some of your secure info could leak out. If a comparison with the hashed password, or even the salt, was leaked by poorly caught errors, then checking for a hash collision is very much a valid way of hacking in.
peachykeen
@peachykeen: considering a lot of dev's use either inline sql or some type of ORM that requires access to run it's own sql calls directly (in other words NOT using stored procedures), then whoever is able to do a select on a database will have access to run insert/update statements. So, checking for collision is by definition unnecessary for a tremendous number of applications.
Chris Lively
@peachykeen: further, if they can see the db query and results then they really don't need the passwords anyway.
Chris Lively
@Chris why are you being hardheaded? The fact is that relying on MD5 compromises your security. In fact it is not at all unlikely that an attacker has the ability to view part of your DB without being able to UPDATE, through SQL injection or otherwise. In fact that's probably the most common entry point to hijacking a web site short of knowing the password outright. If you're so confident, post your site URL that relies on MD5 and hope your query parametrization beats my escape schemes (I'm 80% confident it won't).
Dustin Fineout
@Dustin: You and I are saying pretty much the same thing. I'm just going a bit further. MD5 isn't the problem in my mind, ANY time you only encrypt the password and nothing else is the bigger issue. Further, giving full access at an application level to send custom sql to a database server is a huge problem. I just mentioned how to bypass a particular users security when you have update rights. The same method can be used to just get the data you want to begin with.
Chris Lively
@Dustin: I know you think that parameterization is the main issue. But it's not. What happens when they bypass your code entirely? Most security problems aren't caused by some script kiddy who brute forces a password. The real issues come in when they bypass your code and go straight to the database. Whether by installing their own php/.net/java page on your site or by simply accessing the database directly (with the password you stored in your config file). These attacks don't typically come from outside. They are inside jobs. You can't trust ANYONE.
Chris Lively
MD5 isn't encryption, it's a hashing function (it's one-way). That aside, what do you mean by "and nothing else"? I agree that giving an application free reign to issue any query is epic failure, but all I'm talking about is SELECT functionality. Parametrization isn't a problem (I never meant to imply that and no I don't think it's the main issue, it's a solution), but it must be robust to offer any protection. How would you bypass the code or access a config file? Inside attacks aren't the issue here (aren't we discussing MD5?) obviously anyone with physical access has COMPLETE access.
Dustin Fineout
+3  A: 

The weakness isn't that you can discover the value being hashed. The weakness is that if you know X and MD5(X), it's possible to construct a Y such that MD5(Y) == MD5(X). This means that it's possible to forge a message to match a signature. On top of that, it has a higher rate of collisions than other, stronger algorithms.

In general, you should use SHA-1 or better.

Read this post for more information.

Alex
You don't even have to know X. If you know MD5(X) it's possible to construct a Y such that MD5(Y) == MD5(X).
Stephen P
I am sorry, I weren't asking for the posts. I were asking for the example.
Col. Shrapnel
@Alex: -1. (And @Stephen P as well) You're stating incorrect information. MD5 is known to have collision attacks (possible to construct two messages X and Y such that MD5(Y)==MD5(X)) but not yet known to suffer from preimage attacks, where a hash and/or a message X has been given, and the attack is to construct a message Y where MD5(X) == MD5(Y).
Jason S
+19  A: 

There seems to be two ways in which MD5 can be "broken," not at all equivalent:

  1. It is easy to find two inputs X and Y such that md5(X) == md5(Y).

  2. Given md5(X), it is easy to find a Y such that md5(X) == md5(Y).

It's my understanding that md5 has vulnerability 1, but not 2. Although I see why vulnerability 1 might be undesirable for some applications (e.g. digital signatures), the original poster is using it for password authentication. Why is md5 "broken" for this application?

+1 for the only response here that distinguishes between a preimage attack (your point #2, which is actually a "first preimage attack") and a collision attack (your point #1).
Jason S
*"Why is md5 "broken" for this application?"* - because hashes do not become "less" broken, only more. It takes less than brute-force to brute-force an MD5 hash, which is clearly not what you want... And why use a broken hash when it is just as easy to use a non-broken one?
BlueRaja - Danny Pflughoeft
The scenario is known as a Birthday Attack. When I'm searching for 1 of n answers, I can find it in substantially less than 1/n fraction of the effort needed to get 1 specific answer. Given a copy of the password file (Chris Lively in another answer thinks this is infeasible, but I wonder if he has never misplaced a backup tape in his life), I may not be able to get (or fake) YOUR credentials, but it's likely I could get several sets of credentials by this means.
Jason
@Jason Isn't this a weakness of *any* hash algorithm with fixed-size output, no matter how secure? I don't understand why quickly generating md5 collisions helps this kind of attack.
Yes, that's basic Information Theory - The Pigeonhole principle. The idea with a cryptographically strong hash function is that you have enough pigeonholes to make finding a collision infeasible for the forseable future. That quality no longer applies sufficienty to MD5. Where opinions differ is to whether the attacks are feasible now, or simply feasible soon. But the general concensus, for a couple of years now, is that there is no justification for deploying NEW systems that rely on MD5, and that old systems should review their use and have a plan to migrate away.
Jason
+123  A: 
0xA3
+1 this is the only answer so far that addresses the OP's request for an example of cracking MD5.
Bill Karwin
I looked into that method. From what I could tell it requires the original file/string to hash before generating a collision. That's probably intentional as they didn't want to release a tool to the wild for script kiddies.
AlReece45
Nice. Interesting read. Let's see, when SHA-512 falls :)
Techpriester
Note to OxA3 : +1 for great example. This is exactly what the poster wanted. Note to OP (Col Shrapnel): Other answers are equally valid, they just didn't give you the answer in the format you wanted. This site (because it's well-designed) does not force users to give boolean responses. If so, no one would ever get the BEST answer to their questions. If you have a problem with that, use another site that functions more like a poll.
NateDSaint
-1, you're not distinguishing between preimage attacks and collision attacks. The Peter Selinger demo demonstrates a collision attack, not a preimage attack.
Jason S
@Jason: Col. Shrapnel does not seem to understand that there are different ways for a hash to be broken. No one has claimed that there is a good (first) preimage attack, only that MD5 is broken, which this page very clearly shows that it is.
BlueRaja - Danny Pflughoeft
@BlueRaja: Col. Shrapnel's question pertains to password security. Most people seem to have completely overlooked the issue of the salt.
Will Vousden
use Digest::MD5 qw(md5_hex);my $salt='#bh35^my @var = ();my @num = (1 .. 255);push @var, chr($_) foreach @num;foreach my $a (@var) { foreach my $b (@var) { foreach my $c (@var) { my $md = md5_hex($salt.$a.$b.$c); print "Found: pass ($a $b $c) md5($md)" if $md eq 'c1e877411f5cb44d10ece283a37e1668'; } }}
Konerak
@Jason S: At the time I wrote my answer the question basically was: "Is MD5 really that bad" which Peter Selinger nicely demonstrates. The second part of the question (to guess the password from the hash) had already been addressed by Justin Ethier at that moment.
0xA3
@0xA3: your edit fixes my objection. Thanks.
Jason S
+1 for the great example that illustrates the point nicely for non-cryptos like me. I'd still like to see the Col's password revealed, although that would be mostly for entertainment purposes.
Chris Thornton
what's up with the font? it's too small to read!
Phil
Thanks for this - it way enlightening!
nikic
+2  A: 

You probably don't need to worry about it. The problem with MD5 is that somebody may be able to find another message that hashes to the same value, which only matters in cases where the hash is visible. This is a big problem with message authentication where you publish a message and say that it is valid if it has a specific MD5 hash. If I can find another message that has the same hash, I can pass it off as valid.

In your case, however, it looks like you're hashing passwords. If you keep the hash value private, there's no way to create a collision, so you'll be safe.

Gabe
..as long as there's no external access to the password datastore. Say, by SQL injections and so on.
BalusC
*Unless* someone gains access to the database where all of your hashed passwords are stored, but you'll likely have bigger problems at that point.
Justin Johnson
@Gabe: Not true. MD5 is not yet subject to preimage attacks.
Jason S
Assuming that the password hashes are stored in the same database as the data they're protecting, there's *very* little value in somebody bothering to brute force hash collisions.
Gabe
@Justin Johnson *you* might have bigger problems, but your *users'* biggest problem will be the fact that you leaked their password. You could at least *try* to protect them adequately by using good hashing practices.
hobbs
@hobbs: How does leaking a salted MD5 hash of a password leak a user's password?
Gabe
@Gabe the part where a person with even the most modest resources can brute-force billions of MD5s per second.
hobbs
hobbs: How do you know which of the infinite messages that hash to the same MD5 as my password is actually my password?
Gabe
@Gabe It's overwhelmingly likely to be the shortest one. Even if you have what you think is a long password, it's very unlikely that even one of its MD5 preimages is shorter -- especially among the set of characters that can be *typed*. And that's even more true of the average user, whose password *isn't* long, *doesn't* come from a wide charset, and isn't creative. Or, put practically -- I've *used* password-cracking tools, and I've never had one return me a string that *wasn't* the correct password. The easiest guess is "always" correct.
hobbs
@hobbs: Given that there's no feasible preimage attack on MD5 yet, you're better off with a dictionary attack on the web site -- you don't even *need* access to the password database! If your concern is that MD5 is just too fast, Col. Shrapnel can make it take 1000x longer just by hashing 1000 times in a loop, so why bother with a whole other algorithm?
Gabe
+7  A: 

md5() is broken due to possibility of a malicious person to intentionally generate a collision. This is more of a problem for Certificate Authorities and the Public Key Infrastructure (PKI) than passwords. By generating a hash collision its possible to forge a SSL certificate.

Further more a collision against md5() is generated by using a specially crafted prefix, if you append a salt to the beginning of the message then an attacker cannot control the prefix and a collision cannot be generated.

However my biggest argument is that it is trivial to use a secure message digest function. You can install mhash or use one of the many sha2 php implementations.

Rook
+1 for it is trivial to use a secure algorithm.
Jacco
I haven't been here long, but this seems to be a theme around here: People are violently determined to expend extra energy to accomplish something with (provably) worse results than the accepted practice.
Jason
@Jason I haven't been here long either and I find that people like to chit-chat about security more than actually proving weather or not a system can hold water. Have you ever written an exploit?
Rook
Your question has nothing to do with the current discussion. Did you think I was baiting you? Or are you baiting me... but the answer is yes.
Jason
@Jason well that makes 2 of us(http://www.milw0rm.com/author/677), because its safe to say no one else on SO has.
Rook
I don't have a specific problem with hobbyists (I was one, and I'm sure you were once too), except for two things. It's challenging to remember to answer as precisely as you can when surrounded by so many off-the-cuff remarks, and it's a bigger problem when the situation is mistaken for a democracy rather than a meritocracy (or an autocracy). I expect I'll figure out which one we have here soon enough. Like my day job.
Jason
+6  A: 

Finding a collision means I find two different inputs that result in the same hash.

Finding a preimage means finding an input that results in a single, specified hash. I.e., you hash something, and I know only the hash, and my challenge is to find a second input that results in the same hash as output.

The currently known attacks on MD5 allow generation of collisions, but do not allow the generation of a second preimage.

On the other hand, the fact that MD5 has been broken to the extent necessary to find collisions easily1 gives a fairly strong indication that finding a second preimage is much more likely than with another hash that hasn't been broken to the same degree. It's possible that nobody will ever develop a preimage attack against MD5 -- but then again, it's also possible that somebody already has, and is putting it to use for their own nefarious ends rather than publishing it.

  1. There's a huge difference between the breaks of MD5 and SHA-1: the break against MD5 makes it trivial for almost anybody to create colliding inputs. If you have an old (e.g. Pentium IV) computer lying around mostly unused, it's sufficient to find colliding inputs at a rate of something like one pair per hour or so (and the "or so" mostly translates to "or very likely faster"). The break against SHA-1 is right at the border between theoretical and practical. As close as I can figure it, you'd need to spend something like a million (US) dollars to build a machine that could find one collision per week.
Jerry Coffin
Col. S: comments like that get this question closed
Jason S
@Jason it doesn't really matter as there are no *answers* anyway.
Col. Shrapnel
+70  A: 

Your password is hunter2

Kevin Panko
+ 1 for awesome geek reference
Rubys
Does he keep `1 2 3 4 5` for his luggage?
Donal Fellows
@Donal: amazing! That's the same combination as on my air shield!
Graham Lee
So the combination is one, two, three, four, five. That's the stupidest combination I've ever heard in my life. That's the kinda thing an idiot would have on his air shield.
Christina Mayers
Did OP really accept this as best answer??? Yikes. There were some really good answers below.
webbiedave
@web I was just thinking that. This was amusing, but there were some excellent answers. It got [0xA3](http://stackoverflow.com/users/40347/0xa3) a Populist badge, anyway
Michael Mrozek
so, let me see, you earned 550 rep points by being funny? :))
Andrei Rinea
+6  A: 

What you have done here is epitomized a train of broken thinking into a question that seemingly challenges a logical fallacy.

In fact, your question is as much a statement as it is a fallacy.

If a one way hash collides, even once, don't use it, period, for the purposes that you are describing.

I really, really hope that you don't work for my bank.

Tim Post
Well I didn't think of it from such a point of view. I apologize for that. Let me explain what do I mean. There are many *statements* in our life like "wash your hands before eat". I don't want to *believe* in these statements. I want to understand why. Yes, I've lack of intellect to understand theoretical rezoning. But I suspect that any theory can be proved with simple experiment. And I have never heard of a real example of breaking of salted MD5 hash. While everyone keep talking of it's weakness. I want to distinguish the rumors from the truth. In practical way.
Col. Shrapnel
I think the fallacy boils down basically to this: The strength of a cryptographic hash is based on some mathematical properties of the algorithm. Why wouldn't one use math to *disprove* the strength of that same algorithm? In fact don't exploits against these systems often come as concrete implementations of the math? (ignoring for a moment the class of errors caused by bugs in the original implementation)
Jason
I remember as several years ago I wanted to deliver content in my web-application based on a REQUEST_URI hash, and decided to use md5 as the hash function. I happened to run into a collision on about 300 uris in my development environment and do not believe in md5 since that day =)
newtover
@newtover, you're one lucky guy...@Tim Post: You mean a hash that has a collision published, right? All fixed-length-output hashes have an infinite number of collisions, it's just a matter of being able to find them.
Longpoke
@Longpoke - Published or 'found' by a program breaking. Part of my point is, md5 had uses far beyond hashing passwords.
Tim Post
newtover: can you possibly publish the colliding URLs? To my knowledge no non-contrived MD5 collisions have been published before.
Joshua
+4  A: 

Assuming a password of a maximum of 20 characters and an alphabet of Latin characters and numbers, there are as many as 5e+35 possible strings.

I wrote a very simple program that attempts to generate every possible string and then compares the salted hash against the one you presented.

It will take roughly 16,1e+22 (16+22 zeros) years to calculate all hashes with this program. It's a lot of time, of course, so yeah, it seems to you that your hash is unbreakable. However, now, remember that there are collisions in MD5 hashes, right? So for every two strings that generate the same hashs, you can reduce the total number of calculations in one. If it collides a lot -- then maybe we can find your pass in a feasible time... one week, maybe?

I will try running the application in my home desktop, which has more processing power and maybe multi-thread it, even though I surely won't left it running for a week.

UPDATE: Recalculated the total of years after improving the algorithm: 15e13 years, way below the last estimative. That's with a single thread (actually, two threads, one to do the calculations and another to give status reports every 30 sec...). Tomorrow I might try creating a thread model for this app.

Bruno Brant
+1 for effort, but given the context of a password file, a better argument would be made if the OP provided you with a list of 10k hashes and said, "give me a collision for one". But in either case, you're just doing brute force cracking here. And no digest, SHA2 or otherwise, is going to resist that for long unless you increase the complexity of the passwords + the salt to create enough possibilities that your attacker can't just try them all. Weakening a hash is the process of reducing the search space far below the total number of combinations. And that's what's happening.
Jason
Read the article linked above: http://chargen.matasano.com/chargen/2007/9/7/enough-with-the-rainbow-tables-what-you-need-to-know-about-s.html and especially some of the links out of that article. I think you'll realize that you can shave many orders of magnitude off of your current implementation. Threading will probably only ever shave one or two orders off of the time, but might illustrate how you'd perform the same operation on a bot net. *However I will reiterate that this is still brute-forcing it, and the attacks are about getting below O(2^N) runtime.*
Jason
@Jason: thanks... I will read the article.
Bruno Brant
@Jason: UP! Very f$%@$ cool article. :)
Bruno Brant
+4  A: 

Using a single iteration of a simple hash, even with salt, is entirely inadequate for password hashing, as they're entirely too easy to brute-force even with a hash that isn't broken. Use a proper hash-strengthening scheme or key derivation function such as bcrypt, scrypt, PBKDF2, or Glibc's iterated SHA-2 family hashes.

hobbs
Please don't take it as a trolling but I really don't understand why everyone says it's `entirely too easy`, but noone to show an example? okay, 3 alpha characters in lower case is easy. But what about 8 alphanumeric mixed case?
Col. Shrapnel
That's only a few hundred billion hashes. A modern video card can compute about a hundred million hashes per second. So you're talking a few *hours* for an exhaustive search of alnum^8.
hobbs
Whoops, that's not a modern video card anymore. A 3-year-old video card can do a hundred million hashes per second. A newer one will do a half-billion hashes per second.
hobbs
I had a time to think it over. Does this key derivation help against bruteforce? If so, it would be great, though I don't understand how it can be
Col. Shrapnel
Of course, that's the point. A good KDF is hard to parallelize in hardware, and makes computing the hash arbitrarily difficult -- meaning you can slow down a brute-force attack by a factor of 1,000 or 1,000,000 or whatever factor you like compared to a naïve hash.
hobbs
@hobbs, my 9800GTX can do 350M/sec MD5s, and this card is years old. Current cards can do around 2B/sec afaik.
Longpoke
A: 

I'll come at this a different way. A lot of companies use MD5 hashes to "sign" their files, or assert that a file with a given hash is unique from another file. They base their entire systems on this, especially with respect to file deduplication, or single instance storage.

Now given the fact that you can have different files with the same hash (see these examples), what possible faith can you put into a system that asserts that no two files exist with the same MD5 hash?

Edit to answer comments:

Let's assume a few things, to take it out of the context of the mathmatical realm, and place it into the context of the original question "Why is the use of MD5 bad in practice?"

Say your company is involved in litigation, and the opposing party demands any and all documents relating to "X". You go and buy some software that will crawl all your storage locations and caltalogs the billions of files and emails and attachments, generating an MD5 hash for each. You then exclude all "duplicate" files based on the MD5 hash, and produce the rest of the relevant documents to opposing consul.

Now say the opposing counsel is a bit of an "enthusiastic" litigator, and wants to cast doubt that your company actually met its obligations, specifically in the trustworthiness of using MD5 as a deduplication mechinsim. The opposing pary is going for your company's throat, wanting the judge to impose some hefty sanctions, or even a summary judgement.

So if you were to go in front of a court in a litigation setting, where your company was under penalty of such sanctions, your defense woud be, yes, using MD5 is fine, because:

You need to distinguish among the cases that

  • (a) hash collisions can happen (albeit with extremely small probability),
  • (b) two files can be intentionally constructed to cause a collision (this is a "collision attack", it's possible with MD5),
  • (c) an arbitrary file can be intentionally constructed to cause a collision with another file (preimage attack, not known for MD5)...
  • (d) that a hash collision w/o intentional construction of files is likely to happen (which is not true... you'd need approx 2^64 different files to have a likely collision in a 128-bit hash.)

To which the litigator would likely respond:

  • (a1) Is it possible that two different files can have the same MD5 hash?

    (your answer would have to be yes)

  • (b1) Do you know if there are any examples of two different files that have the same MD5 hash?

    (again, your answer would have to be yes)

At this point, you have lost support in the eyes of most judges. It is now up to your legal team to steer the course back onto the "MD5 is fine" track. I'd rather not be in that position in the first place. At least with SHA-256 or other longer hashes, you can answer "No" to (b1). And thus, the whole point to the question: "Why is the use of MD5 bad in practice?"

GalacticJello
You can always have different files with the same hash: any practical hash is a function with domain space much larger than range space, so there will always be many inputs mapping into the same output. You need to distinguish among the cases that (a) hash collisions *can* happen (albeit with extremely small probability), (b) two files can be intentionally constructed to cause a collision (this is a "collision attack", it's possible with MD5), (c) an arbitrary file can be intentionally constructed to cause a collision with another file (preimage attack, not known for MD5)...
Jason S
...and (d) that a hash collision w/o intentional construction of files is likely to happen (which is not true... you'd need approx 2^64 different files to have a likely collision in a 128-bit hash.) For further reference see RFC4270 http://tools.ietf.org/html/rfc4270
Jason S
Depending on the document format used I'd probably be able to answer the judge that no meaningful document could be produced that way (MD5 collision generator likes to generate files with bad bytes).
Joshua
And then the opposing conusel would retort with these two files: http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/letter_of_rec.ps http://th.informatik.uni-mannheim.de/people/lucks/HashCollisions/order.ps (use this to view: http://view.samurajdata.se/)
GalacticJello
The point is, you don't need to put yourself in that position in the first place, as the original question is asking "Why is it a bad practice?" If you are designing a new system, don't add that risk. It's following a "best practice" that experts have been saying for years: don't use MD5.
GalacticJello
:shrug: -- anyone using hash algorithms should know what their strengths and weaknesses are. nobody should use them for applications where collision attacks put the application at risk. If collision attacks are not a risk (e.g. website password + salt hashing where preimage attacks are the only risk -- and that only if the website database is compromised) it's fine to use MD5 but people need to be aware enough to make that determination on their own.
Jason S
using a hash like md5 for deduplication is fine aslong as you if you have a collision, check the files byte by byte for their difference. this will give you a 100% chance that you don't kill false duplicates. so you check all files and do a bucket sort according to their hashes and then compare all files in the buckets with eachoter, if there are only few files, or do a second hash like SHA and resort them according to the SHA buckets and finaly just check all the files that are not alone in one bucket.
ITroubs
@iTroubs: That's my point. Why use MD5 when you know you will have collisions, and then when you hit a collision, you end up hashing the file using another method? That introduces signifigant slowdowns in processing billions of documents, especially when you have many, many legitimate duplicate files. You have to ensure they _really_ are duplicates, and jump through a ton of hoops.
GalacticJello
A: 

The first step is admitting the problem which you are doing by writing this question. That's good.

Now step slowly away from the broken algorithm and use one of the many fine alternatives. If you really want to be forward looking, use one of the new candidate hashes submitted to NIST such as Skein.

OTOH your simplest and most widely implemented bet is probably just SHA-2 at 256 bits. That's a wonderful middle ground of size and strength. I'd advise against using SHA-1 at this point as it will be the next to fall. New code should anticipate the next move.

WRT MD5, that has pretty much been answered by others. Safe to say major X.509 certificate implementations still depend on that old beast, but anyone writing new code should use something that isn't known to be broken. The harder question is how to get rid of MD5 from old code, it's sort of like the year 2000 problem except it has no end date so it ends up just being a multi-decade security flaw happening over and over again.

cyphers