tags:

views:

201

answers:

6

I read the Wikipedia article about md5 hashes but I still can't understand how a hash can't be "reconstituted" back to the original text.

Could someone explain to someone who knows very little about cryptography how this works? What part of the function makes it one-way?

+17  A: 

Think of a really basic hash - for the input string, return the sum of the ASCII values of each character.

hash( 'abc' ) = ascii('a')+ascii('b')+ascii('c')
              = 97 + 98 + 99
              = 294

Now, given the hash value of 294, can you tell what the original string was? Obviously not, because 'abc' and 'cba' (and countless others) give the same hash value.

Cryptographic hash functions work the same way, except that obviously the algorithm is much more complex. There are always going to be collisions, but if you know string s hashes to h, then it should be very difficult ("computationally infeasible") to construct another string that also hashes to h.

Graeme Perrow
so it seems that all hash functions will have _some_ "collisions"...the goal in designing a good algorithm is keeping collisions as low as possible?
@unknown(google) That is true
molecules
Yes. I've updated my answer.
Graeme Perrow
No, the goal is for the collisions to be evenly distributed and unpredictable. For example, for the basic hash given in this answer, you can easily predict that `hash('acb')` (and a lot of others) will have the same result as `hash('abc')`. For a strong hash, there shouldn't be any way to do that (convert one input into another which has the same hash) that is faster than just hashing random inputs until you find one with the right hash (bruteforce).
caf
Thus, what Graeme has provided is actually what's called a checksum - a hash is a checksum with much more stringent requirements (http://en.wikipedia.org/wiki/Cryptographic_hash_function). Basically, they achieve this by making the algorithm mash together the input in as complex and confusing a way as possible (while still being very fast to compute)
BlueRaja - Danny Pflughoeft
+5  A: 

A hash is a (very) lossy encoding.

To give you a simpler example, imagine a fictitious 2-letter encoding of a 5-letter word called the X-encoding. The algorithm for the X-encoding is simple: take the first and last letters of the word.

So,

X-encode( SAUCE ) = SE
X-encode( BLOCK ) = BK

Clearly, you cannot reconstruct SAUCE from its encoding SE (assuming our range of possible inputs is all 5-letter words). The word could just as easily be SPACE.

As an aside, the fact that SAUCE and SPACE both produce SE as an encoding is called a collision, and you can see that the X-ecoding wouldn't make a very good hash. :)

ezod
s/encoding/one-way-encryption
BalusC
Well, encryption is just a type of encoding...
ezod
+9  A: 

Since everyone until now has simply defined what a hash function was, I will bite.

A one-way function is not just a hash function -- a function that loses information -- but a function f for which, given an image y ("SE" or 294 in existing answers), it is difficult to find a pre-image x such that f(x)=y.

This is why they are called one-way: you can compute an image but you can't find a pre-image for a given image.

None of the ordinary hash function proposed until now in existing answers have this property. None of them are one-way cryptographic hash functions. For instance, given "SE", you can easily pick up the input "SXXXE", an input with the property that X-encode("SXXXE")=SE.

There are no "simple" one-way functions. They have to mix their inputs so well that not only you don't recognize the input at all in the output, but you don't recognize another input either.

SHA-1 and MD5 used to be popular one-way functions but they are both nearly broken (specialist know how to create pre-images for given images, or are nearly able to do so). There is a contest underway to choose a new standard one, which will be named SHA-3.

An obvious approach to invert a one-way function would be to compute many images and keep them in a table associating to each image the pre-image that produced it. To make this impossible in practice, all one-way function have a large output, at least 64 bits but possibly much larger (up to, say, 512 bits).

EDIT: How do most cryptographic hash functions work?

Usually they have at their core a single function that does complicated transformations on a block of bits (a block cipher). The function should be nearly bijective (it shouldn't map too many sequences to the same image, because that would cause weaknesses later) but it doesn't have to be exactly bijective. And this function is iterated a fixed number of times, enough to make the input (or any possible input) impossible to recognize.

Take the example of Skein, one of the strong candidates for the SHA-3 context. Its core function is iterated 72 times. The only number of iterations for which the creators of the function know how to sometimes relate the outputs to some inputs is 25. They say it has a "safety factor" of 2.9.

Pascal Cuoq
+1 for "you should not recognize the input at all in the output"
rossoft
+3  A: 

Here's a very simple example. Assume that I'm a beginning cryptographer and I create a hash function that does the following:

int SimpleHash(file) {
    return 0 if file.length is even;
    return 1 if file.length is odd;
}

Now here's the test. SimpleHash(specialFile) is 0. What was my original file?

Obviously, there's no way to know (although you could likely discover pretty easily that my hash is based on file length). There is no way to "reconstitute" my file based on the hash because the hash doesn't contain everything that my file did.

Kaleb Pederson
You have defined a hash function. A cryptographic hash function is a hash with strong properties that your answer does not mention. It is not enough that the file can't be reconstituted; it has to be difficult for someone **knowing the function** to craft a file that hashes to **0**. In fact, if your function outputs a boolean, it cannot be one-way -- it only need to be tried on to a couple of inputs to be cracked.
Pascal Cuoq
Well, it can be one-way - but it's strength is only 1 bit. Which is clearly bruteforceable with paper and pencil :)
caf
@Pascal You're very correct about good properties of a cryptographic hash function. I intended only to provide an example that was completely obvious that there was no way to get back the original or really even guess at what it might have been. I hope I answered the OP's question.
Kaleb Pederson
+2  A: 

In simple terms, a hash function works by making a big tangled mess of the input data.

See MD5 for instance. It processes input data by 512-bit blocks. Each block is split into 16 32-bit words. There are 64 steps, each step using one of the 16 input words. So each word is used four times within the course of the algorithm. This is where one-wayness comes from: any input bit is input at several places, and between two such inputs the function mixes all the current data together so that each input bit impacts most of the 128-bit running state. This prevents you from inverting the function, or computing a collision, by looking at only a part of the data. You have to look at the whole 128 bits, and the space of 128-bit blocks is too wide to be efficiently walked through.

Now MD5 does not do a good job at it, since collisions for that function can be found. From a cryptographer point of view, MD5 is a rotated encryption function. The processing of one message block M (512 bits) uses an input state V (a 128-bit value) and computes the new state V' as V' = V + E(M, V) where '+' is a word-wise addition, and 'E' happens to be a symmetric encryption function (aka a 'block cipher') which uses M as key and V as the message to be encrypted. From a closer look, E can is a kind of "extended Feistel network", similar to the DES block cipher, with four quarters instead of two halves. Details are not important here; my point is that what makes a "good" hash function, among hash functions which use that structure (called "Merkle-Damgård"), is similar to what makes a block cipher "secure". The successful collision attacks on MD5 use differential cryptanalysis, a tool which was designed to attack block ciphers in the first place.

From a good block cipher to a good hash function, there is a step which is not to be dismissed. With the Merkle-Damgård structure, the hash function is secure if the underlying block cipher is resistant to "related key attacks", a rather obscure property against which block ciphers are rarely strengthened because, for symmetric encryption, related key attacks barely have any practical impact. For instance, the AES encryption turned out not to be as resistant to related key attacks as could be wished for, and this did not trigger general panic. That resistance was not part of the properties which were sought for when AES was designed. It just prevents turning the AES into a hash function. There is a hash function called Whirlpool, which builds on a derivate of Rijndael, "Rijndael" being the initial name of what became the AES; but Whirlpool takes care to modify the parts of Rijndael which are weak to related key attacks.

Also, there are other structures which can be used for building a hash function. The current standard functions (MD5, SHA-1, and the "SHA-2" family, aka SHA-224, SHA-256, SHA-384 and SHA-512) are Merkle-Damgård functions, but many of the would-be successors are not. There is an ongoing competition, organized by the NIST (the US federal organization which deals with that kind of things), to select a new standard hash function, dubbed "SHA-3". See this page for details. Right now, they are down to 14 candidates from an initial 51 (not counting a dozen extra which failed the administrative test of sending a complete submission with code which compiles and runs properly).

Let's now have a more conceptual look. A secure hash function should look like a random oracle: an oracle is a black box which, when given a message M as input, outputs an answer h(M) which is chosen at random, uniformly, in the output space (i.e. all n-bit strings if the hash function output length is n). If given the same message M again as input, the oracle outputs the same value than previously. Apart from that restriction, the output of the oracle on a non previously used input M is unpredictable. One can imagine the oracle as a container for a gnome who throws dice, and carefully records the input messages and corresponding outputs in a big book, so that he will honor his oracle contract. There is no way to predict what the next output will be since the gnome himself does not know that.

If a random oracle exists, then inverting the hash function has cost 2^n: in order to have a given output, there is no better strategy than using distinct input messages until one yields the expected value. Due to the uniform random selection, probability of success is 1/(2^n) at each try, and the average number of requests to the dice-throwing gnome will be 2^n. For collisions (finding two distinct inputs which yields the same hash value), the cost is about *1.4*2^(n/2)* (roughly speaking, with *1.4*2^(n/2)* outputs, we can assemble about 2^n pairs of output, each having a probability of 1/(2^n) of matching, i.e. having two distinct inputs which have the same output). These are the best that can be done with a random oracle.

Therefore, we look for hash functions which are as good as a random oracle: they must mix the input data in such a way that we cannot find a collision more efficiently than what it would cost to simply invoke the function 2^(n/2) times. The bane of hash function is mathematical structure, i.e. shortcuts which allow the attacker to view the hash function internal state (which is big, at least n bits) as a variation on a mathematical object which lives in a much shorter space. 30 years of public research on symmetric encryption systems have produced a whole paraphernalia of notions and tools (diffusion, avalanche, differentials, linearity...) which can be applied. Bottom-line, however, is that we have no proof that a random oracle may actually exist. We want a hash function which cannot be attacked. What we have are hash function candidates, for which no attack is currently known, and, somewhat better, we have some functions for which some kinds of attack can be proven not to work.

There is still some research to be done.

Thomas Pornin
A: 

Shooting for a simple analogy here instead of a complex explanation.

One way hashes are called that because they are not reversible. More typical operations can be reversed.

Addition:

4 + 3 = 7  

can be reversed by taking the sum and subtracting one of the addends

7 - 3 = 4  

Multiplication:

4 * 5 = 20  

can be reversed by taking the product and dividing by one of the factors

20 / 4 = 5

Modulo division:

22 % 7 = 1  

can not be reversed because there is no operation that you can do to the quotient and the dividend to reconstitute the divisor (or vice versa).

One-way hash functions have the same mathematical quality as modulo division. See the other answers for more specifics on the different hash functions.

Kelly French