views:

862

answers:

13

I was reading a question about MD5, and it made me remember something that boggles me. Very simple question, and I'm sorry if it's not a good one. I just can't understand how you convert something to one thing using some algorithm, and there being no way to convert it back using the algorithm in reverse.

So how is this possible?

Also, since multiple strings can create the same MD5 hash, due to it being less data than the input string, how would any other hashing system be any better?

+60  A: 

Basically it's because the output of MD5 contains less information than the input. This is what distinguishes a hash algorithm from an encryption algorithm.

Here's a simple example: imagine an algorithm to compute the hash of a 10-digit number. The algorithm is "return the last 2 digits." If I take the hash of 8023798734, I get 34, but if all you had is the 34, you would have no way to tell what the original number is because the hashing algorithm discarded 8 digits worth of information. It's similar with MD5, except that the hash is computed via a complex procedure instead of just chopping off part of the data.

EDIT: in response to the edit in the question, that's actually an interesting question. For one thing, the probability of a collision (finding two inputs that produce the same output) is basically 1 divided by the number of possible hash outputs. Collisions are an undesirable feature of hashes in the sense that you want to make them as unlikely as possible, so one way to get a better hash algorithm is to use a hash with a longer output. In the digits example above, taking the last 4 digits instead of the last 2 digits reduces the probability of a collision to 1 in 10000 instead of 1 in 100, so it's more likely that all the 10-digit numbers in whatever set you have will have different hash values.

There's also the issue of cryptographic security. When you want to use a hash to make sure that some data is not tampered with, it's desirable that whoever's doing the tampering can't predict what inputs will produce a given output. If they could, they would be able to alter the input data in such a way that the output (the hash) remains the same. Going back to the digits example again, let's say I'm going to email you the number 1879483129 and it is critically important that this number gets to you unaltered. I might call you up and tell you the hash of the number, which would be 29, and that way if something happened to the data in transit, you'd know. (Well not in this case because this is a useless hash algorithm, but you get the gist ;-) However, since the "last 2 digits" algorithm is not cryptographically secure, a nefarious hacker could change the number en route to, say, 5555555529 and you wouldn't know the difference.

It's recently been shown that MD5 is not cryptographically secure, or at least not as much as we might like. That means that it is possible to find different inputs which correspond to any given output. It's still a fine algorithm for protecting against random bit flips and the like, but if there's a chance someone might want to intentionally corrupt your data, you should really use something more secure, like SHA-256.

David Zaslavsky
Edited my original post, please read the new section at the bottom.
Rob
+1 - excellent simple example.
zombat
@Rob: I saw that, I was editing while you were typing your comment.
David Zaslavsky
Thanks, great example.
Rob
+1  A: 

Here's a simple answer...

There are a finite number of hash values, and an infinite number of hashable, plaintext values.

Therefore, reversing a given MD5 hash would result in an infinite number of possible plaintext values.

Dolph
Technically, it might be the case that there for some hashes may be only a finite amount of different plaintexts that hash to it, may it not?
Sebastian P.
Edited my original post, please read the new section at the bottom.
Rob
@Sebastian P: Not really. Given an (theoretically infinite) set of inputs, MD5 should generate a (theoretically infinite) set of outputs. (It is theoretically impossible, given a non-1:1 hashing function and an infinite number of inputs, that any given hash aligns to a finite number of hashes; this is an implication of infinite sets, rather than an implication of MD5, i.e. there are a fixed number of valid MD5 sums, 2 ^ 128; given an infinite set of values with an MD5 sum, there are an infinite number of collisions for each hash).
ig0774
@ig0774: I guess technically it's possible to construct a weird hash algorithm that maps only a finite set of inputs to one output. In an extreme example, `hash(x) = x<64 ? x : 63` for a 6-bit hash of non-negative integers `x`. But as far as hashing algorithms that are actually used in practice, I would be very surprised to find one that mapped a finite number of inputs to one output. (There are probably mathematical proofs about this for the major algorithms)
David Zaslavsky
@David: I guess you are correct about hash algorithms in general (and such an algorithm might meet some technical need or another). My point was about MD5 in particular.
ig0774
@ig0774: There are a fixed number of MD5 hashes, yes, and an infinite number of texts that can be hashed by the algorithm, yes. I am just curious as to why you say that each of these 2^128 hashes is guaranteed to have an infinite number of plaintexts that hash to that hash; that fact is not obvious to me. It is clear that it would be a desirable property that most likely is fulfilled in MD5, but I am curious about wherein the proof of the statement lies.
Sebastian P.
+9  A: 

Here's a parallel:

Add up the ages of everyone in your family. Only keep the last two digits.

Now tell me everyone's ages based on that one number.

Joe
+1  A: 

Think about this:

I have a numeric string, say it's "12345678".

I have a hash algorithm say it's get the sum of each single number , it's name defined to f()

so i get f("12345678") = 1 + 2+ .. + 8 = 36.

Then how can you get x = "12345678" from f(x) = 36?

I know this algorithm is bad, but this can answer you question.

Colin Niu
+43  A: 

I just can't understand how you convert something to one thing using some algorithm, and there being no way to convert it back using the algorithm in reverse.

You can turn a cow into hamburger, but you cannot turn hamburger into a cow.

The transformation reduces data that exists by destroying it, and that data cannot be recovered.

Tangurena
But the result is delicious.
Michael Petito
I like cows :( That's terrible.
Rob
+1 for the most inspired analogy I've seen. Mmm... hash functions and hamburgers.
ig0774
Of course you can turn hamburgers into cows, just as you can turn scrambled eggs into chicken. You feed hamburgers to a calf and grow it into a full cow. It's inefficient, you waste a lot of tasty hamburger in the process, but it works.
Remus Rusanu
@Remus, and then you run a non-insignificant chance of mad cow disease.@Rob, would potatoes into hashbrowns be less offensive? Myself, I like cows, too. They taste great.+1 to @tangurena for initial analogy!
Nathan Ernst
Its not offensive, I just really like cows =[ Not a vegetarian or anything, you're right, they DO taste great. But still =[
Rob
And just like hamburgers, hashes frequently require salt.
Randolpho
A: 

Depends on the operations made by the algorithm, a very simple example of lossy operations is the division between numbers.

3/17 = 0.1764705

Now try to do 0.1764705 * 17, it's equal to 2.9999985.
You can choose also to keep only the first 3 decimals and in this case it will be much more lossy.

Basically if you want to reduce by calculation 1GB of contents to 32bytes string you must use a lossy algorithm, otherwise you will obtain the compression of file.

Cesar
+2  A: 

In answer to the second part of your question (an answer to the first part having been more than adequately given by others above): MD5 is considered weak due the proofs of attacks against the cipher (i.e., changes that can be made in the plain-text that do not result in changes in the MD5 sum). Other hashing techniques may not be as easily susceptible to essentially arbitrary hash collisions (at least such arbitrary collisions have not, as yet, been shown to be possible with the SHA-2 set of hashes, etc.), and hence, an attacker is less likely to be able to replicate a hash hashed in a non-MD5 technique (theoretically, of course, hash collision attacks are possible against any hashing function; it would not succeed as a hashing function if this were not the case; the question is how easily an attacker can succeed in "faking" a "correct" plaintext, that is, one that hashes to the same hash value).

Incidentally, the MD5 sum of a plaintext is not necessarily safe because it contains "less" data or is "lossy", but because, from an arbitrary plaintext, it computes a sum-value within a fixed range (for plaintexts < 128 bits, the MD5 sum, in fact, contains more information than the plaintext...), and, therefore a number (theoretically infinite) of plaintext could all align to the same MD5 hash.

ig0774
A: 

Also, since multiple strings can create the same MD5 hash, due to it being less data than the input string, how would any other hashing system be any better?

While it is true that there must exist multiple (even infinitely many) messages that have the same hash, the goal of a cryptographic hash is to make it infeasible to find such collisions.

You might be thinking that one could just find collisions by calculating the hashes of random messages until you eventually get the same result twice. However, you'd be underestimating the size of the space of possible hash values.

For MD5, the size of the hash is 128 bits. The 128 bit space is, to paraphrase Douglas Adams, big. Really big. You just won't believe how vastly hugely mindboggingly big it is. The number of possible hashes is 2128, or 3.40282367 × 1038. That's a 34 followed by 37 zeroes! If you could count to a trillion in one second, it would still take you 10 billion millenia to count through all 128 bit numbers.

However, some hash algorithms like MD5 have weaknesses that allow attackers to reverse it (i.e. find a message with a given hash) with significantly less effort compared to just brute force attempts. Algorithms with known weaknesses are therefore considered worse. MD5 is considered completely broken in this regard.

Wim Coenen
Even Google can help: http://md5crack.com/
Pascal Thivent
A: 

Hmm, not to be rude, but it seems to me that all the answers about "less information coming out than going in" miss the point.

The main use of MD5, and similar cryptographic hash codes, is to encrypt passwords. In that case, I don't care whether it's possible to reconstruct the original string. All I care is whether I can construct any string that would hash to the same value.

Take a simplified example: Suppose our hash algorithm was "take the last two digits". So if my password is "12345678", the hash code is "78". Is there any way to go from "78" back to "12345678"? No. But if I'm hacking passwords, I don't care if I know what your original password was. I just want a password to let me in. So if I knew this was the algorithm, I'd say great, I'll use the password "99978". It hashes to "78", so the password validation algorithm will pass it, and I'm in.

Obviously MD5 is much more difficult to reverse, even in this "anything that will hash to the right value" sense, then a simplistic algorithm like "take the last two digits". But is it literally impossible? That puzzles me, too. So sure, information is discarded along the way. But couldn't I reverse to to an "any" value by filling in any random value at any point where information is discarded? I haven't looked at the actual algorithm for MD5. I presume it's not something easy to reverse, like change all the pluses to minues or something trivial like that, or somebody would have done it a long time ago. From the fact that there are millions of hackers out there who have tried to crack this, even if it is theoretically possible, it must be incredibly difficult.

Jay
The original question was about the existence of algorithms for which it's not possible to recover the original input given the output, and MD5 definitely qualifies as one. As for your idea of simply finding _any_ input that will hash to a given output, it's definitely possible if you want to build a lookup table, but that takes massive amounts of computing power. I don't think you could just run the algorithm in reverse, filling in random data as necessary, because MD5 scrambles bits at every stage. (I used to know the algorithm but I've long forgotten it)
David Zaslavsky
I agree these answerws are technically correct, I was just trying to point out that they are not really relevant to the real problem of protecting passwords. To take it to a ridiculous extreme, an algorithm that threw out ALL input data and mapped everything to the string "foobar" would make it impossible to have any clue what the original input was, but it would be pretty useless as a password-encrypting algorithm!
Jay
Anyway, as I say, I haven't looked at the MD5 algorithm. No matter how it scrambles bits, I would think that, IN PRINCIPLE, you could always fill in any discarded bits with, say, constant zeros. I presume, though, that the nature of the function is such that you can't simply reverse each step. An algorithm that said multiply by 2 and add 5 could easily be reversed by subtracting 5 and dividing by 2. But more complex manipulations might be difficult to reverse step-by-step like that. Someday I'll have to actually study the algorithm. When I'm sitting around with nothing to do.
Jay
A: 

Essentially, the bit operations involved mean that reversing it would be technically infeasible. In order to construct a set of outputs, you would require insane time complexity and huge memory complexity. It's not impossible at all - but it doesn't have to be, merely beyond the power of even our best supercomputers by a mile.

DeadMG
+1  A: 

Consider the following function: f(x) = x*x. Now, given that you know f(x)=25, what is x? Well, the answer could be 5 or the answer could be -5. You cannot recover the input to f, because there exists some value in the range of f such that more than one element of the domain of f maps to that value under f. Consequently, the function f is non-invertible. The same concept applies to MD5; there are multiple inputs to the MD5 algorithm that will, despite being different inputs, yield the same hash value as a result. In other words, the MD5 algorith, like f(x)=x*x, is not one-to-one and therefore not an invertible function.

However, this does not mean that you cannot recover the input to an MD5. It simply means you cannot recover the input to and MD5 with 100% certainty. To make this more concrete, let's look again at the function f(x)=x*x. Now what if I told you that for any given input to f the probability of it being positive is 99%? In that case, you could make a very good guess that a hash of 25 came from a value of 5, and not -5. This is, indeed, how people are able to break hash functions (including MD5, which is, it turns out, not a very good cryptographic hash function). When it comes to passwords, there are certain passwords which are used far more frequently than other passwords. All you need to do is take the MD5 of those password and compare it with some hash, and if they match, then it is a pretty reasonable guess that it came from that password.

You may also be interested in reading about one-to-one functions, Injective functions, cryptographic hash functions, MD5, SHA1, and Don't Hash Secrets from the Benlog Security Blog.

Michael Aaron Safyan
+1  A: 

Also, since multiple strings can create the same MD5 hash, due to it being less data than the input string, how would any other hashing system be any better?

An attack is known against MD5 which makes it possible for the attacker to create multiple documents with different contents but the same MD5 hash. This attack is computationally feasible, and as a demonstration, is was used to "predict" the outcome of a presidential election. (The attacker published a hash before the election, then afterward revealed a document with that hash giving the winner's name. But actually the attacker had a document for each candidate, all with the same hash.)

A better system would provide a cryptographic guarantee, that it is computationally intractable to create two distinct documents that hash to the same value. SHA-1 may be such a system.

An even worse system would allow an attack whereby given access to any hash, you could create a document with that hash. The venerable CRC system, which is still used in many hardware systems (think Ethernet), is vulnerable to this attack. Like MD5, it is a hash function in which the output is not reconstructable from the input, but given any output, it is trivial to construct a document with a given CRC-32 or CRC-64 signature. Worse, you can put any text you like in such a document, then get the CRC you want just by adding junk at the end.

It's no coincidence that CRC-32 can be computed very quickly, MD5 takes significantly longer, and SHA-1 takes somewhat longer than that. Both the cost models and the trust models are difficult.

A really good hash function would be as quick to compute as CRC and as difficult to construct two documents hashing to the same value as SHA-1. Don't hold your breath...

Norman Ramsey
Actually CRC-32 is not that fast; on some systems (e.g. ARM), some hash functions (MD4, Panama...) turn out to be faster than CRC-32. CRC-32 is very fast in _hardware_, as a circuit, but when implemented in software, one must use look-up tables and many RISC processors are rather uncomfortable with tables.
Thomas Pornin
A: 

Most of the answers don't hit the real point of the question: the hashing transformations are non linear, and as such are very difficult (but not impossible, given enough computational power and time) to reverse.

Think about the relative difficulty of squaring a number and getting the square root. Add to that that you just have partial information and all the missing bits are important to yield the correct answer (not as in the example of cropping a number).

If after all you're still not sure, then try by yourself to reverse the steps of MD5 or any other cryptographic hash function ;-)

fortran