views:

436

answers:

6

I was reading this question on MD5 hash values and the accepted answer confuses me. One of the main properties, as I understand it, of a cryptopgraphic hash function is that it is infeasible to find two different messages (inputs) with the same hash value.

Yet the consensus answer to the question "why aren't MD5 hash values reversible?" is because "an infinite number of input strings will generate the same output." This seems completely contradictory to me.

Also, what perplexes me somewhat is the fact that the algorithms are public yet the hash values are still irreversible. Is this because there is always data loss in a hash function so there's no way to tell which data was thrown away?

What happens when the input data size is smaller than the fixed output data size (e.g., hashing a password "abc")?

Thanks!

EDIT:

OK, let me see if I have this straight:

  1. It is really, really hard to infer the input from the hash because there are an infinite amount of input strings that will generate the same output (irreversible property).
  2. However, finding even a single instance of multiple input strings that generate the same output is also really, really hard (collision resistant property).
+10  A: 

1: The primary purpose of a hash is to map a very, very large space to a smaller but still very large space (e.g., MD5, which will take 'anything' and convert it into a space of size 2^128 -- big, but not nearly as big as aleph-0.)

Good hashes fill the destination space homogeneously. Bad hashes fill the space in a clumpy fashion, coming up with the same hash for many common inputs.

Imagine the idiotic hash function sum(), which just adds all the digits of the input number: it succeeds in mapping down, but there are a bunch of collisions (inputs with the same output, like 3 and 12 and 21) at the low end of the output space and the upper end of the space is nearly empty. As a result it makes very poor use of the space, is easy to crack, etc.

So a good hash that makes even use of the destination space will make it difficult to find two inputs with the same output, just by the odds: if MD5 were perfect, the odds that two inputs would have the same output would be 2^-128. That's pretty decent odds: the best you can do without resorting to a larger output space. (In truth MD5 isn't perfect, which is one of the things that makes it vulnerable.)

But it will still be true that a huge number of inputs will map to any given hash, because the input space is 'infinite', and dividing infinity by 2^128 still gives you infinity.

2: Yes, hashes always cause data loss, except in the case where your output space is the same as, or larger than, your input space -- and in that case you probably didn't need to hash!

3: For smaller inputs, best practice is to salt the input. Actually, that's good practice for any cryptographic hashing, because otherwise an attacker can feed you specific inputs and try to figure out which hash you are using. 'Salt' is just a set of additional information that you append (or prepend) to your input; you then hash the result.

Alex Feinman
-1 You missed the point that it should be computationally hard to reverse the hash function. A linear function may distribute the hash values very well and still be unsuitable for cryptography.
starblue
I'm sorry, I didn't mean to imply that the function linearly distributes the function, just that the distribution of numbers should be smooth at the large scale.
Alex Feinman
+1 even though some details are missing, I think this answer is still useful.
laalto
A: 

Yet the consensus answer to the question "why aren't MD5 hash values reversible?" is because "an infinite number of input strings will generate the same output."

This is true for any hash function, but it is not the essence of a cryptographic hash function.

For short input strings such as passwords it is theoretically possible to reverse a cryptographic hash function, but it ought to be computationally infeasible. I.e. your computation would run too long to be useful.

The reason for this infeasibility is that the input is so thoroughly "mixed together" in the hash value that it becomes impossible to disentangle it with any less effort than the brute force attack of computing the hash value for all inputs

starblue
A: 

"why aren't MD5 hash values reversible?" is because "an infinite number of input strings >will generate the same output"

this is the reason that it isn't possible to reverse the hash function (get the same input). cryptographic hash functions are collision resistant, that means that it's also hard to find another input value that maps to the same output (if your hash function was mod 2 : 134 mod 2 = 0; now you can't get the 134 back from the result, but we can stil find number 2 with the same output value (134 and 2 collide)).

When the input is smaller than the block size, padding is used to fit it to the block size.

cube
It still doesn't make sense it is hard to find two inputs that produce the same output yet the fact that many inputs have the same output is the reason the hash is irreversible. How is that not a contradiction?
vg1890
reversing the function is something different than finding a collision. Ideally the only way of finding the collision would be trying one input after another and comparing the ouptut of the hash function to the value you want to reverse/find collision of (that is hard). But even if you did that, you wouldn't know if the collision you found was the original one or if you just found a new string with the same hash value.
cube
+1  A: 

These are the properties of hash functions in general.

A word of caution though, MD5 shouldn't be used anymore because of vulnerabilities that have been found in it. Check the 'Vulnerabilities' section and external links detailing these attacks. http://en.wikipedia.org/wiki/Md5 You can make an MD5 collision by changing only 128 bits in a message.

SHA-1 is safe for simple hashing although there are some attacks that would make it weaker against well-funded entities (Governments, large corporations)

SHA-256 is a safe starting point against technology for the next couple decades.

Not necessarily. The accepted answer in the question that I linked to uses an example of a hash function H(x) = x mod 2. This hash function exhibits the hard-to-reverse property but not the low collision property.
vg1890
+8  A: 

Warning: Long answer

I think all of these answers are missing a very important property of cryptographic hash functions: Not only is it impossible to compute the original message that was hashed to get a given hash, it's impossible to compute any message that would hash to a given hash value. This is called preimage resistance.

(By "impossible" - I mean that no one knows how to do it in less time than it takes to guess every possible message until you guess the one that was hashed into your hash.)

(Despite popular belief in the insecurity of MD5, MD5 is still preimage resistant. Anyone who doesn't believe me is free to give me anything that hashes to 2aaddf751bff2121cc51dc709e866f19. What MD5 doesn't have is collision resistance, which is something else entirely.)

Now, if the only reason you can't "work backwards" in a cryptographic hash function was because the hash function discards data to create the hash, then it would not guarantee preimage resistance: You can still "work backwards", and just insert random data wherever the hash function discards data, and while you wouldn't come up with the original message, you'd still come up with a message that hashes to the desired hash value. But you can't.

So the question becomes: Why not? (Or, in other words, how do you make a function preimage resistant?)

The answer is that cryptographic hash functions simulate chaotic systems. They take your message, break it into blocks, mix those blocks around, have some of the blocks interact with each other, mix those blocks around, and repeat that a lot of times (well, one cryptographic hash function does that; others have their own methods). Since the blocks interact with each other, block C not only has to interact with block D to produce block A, but it has to interact with block E to produce block B. Now, sure, you can find values of blocks C, D, E that would produce the blocks A and B in your hash value, but as you go further back, suddenly you need a block F that interacts with C to make D, and with E to make B, and no such block can do both at the same time! You must have guessed wrong values for C, D, and E.

While not all cryptographic hash functions are exactly as described above with block interaction, they have the same idea: That if you try to "work backwards", you're going to end up with a whole lot of dead ends, and the time it takes for you to try enough values to generate a preimage is on the order of hundreds to millions of years (depending on the hash function), not much better than the time it would take just to try messages until you find one that works.

Zarel
+4  A: 

You may be confused, because the answer to the question you cite is confusing. One of the requirements for a cryptographic hash function is that it should be preimage resistant. That is, if you know MD5(x) but not the message x, then it is difficult to find any x' (either equal x or different from x) such that MD5(x') = MD5(x).

Being preimage resistant is a different property than being reversible. A function is reversible if given f(x) it is possible to find x. For example define f(x) = x mod 10. Then f is not reversible. From f(x) = 7 you can't determine whether x was 17, 27 or something else. But f is not preimage resistant, since values x' such that f(x) = 7 are easy to find. x' = 17, 27, 12341237 etc all work.

When doing crypto you usually need functions that are preimage resistant (and other properties such as collision resistance), not just something that is not reversible.

Accipitridae