views:

392

answers:

4

Given two different strings S1 and S2 (S1 != S2) is it possible that:

SHA1(S1) == SHA1(S2)

is True?

  1. If yes - with what probability?
  2. If not - why not?
  3. Is there a upper bound on the length of a input string, for which probably of getting duplicates is 0? OR is the calculation of SHA1 (hence probability of duplicates) independent of the length of the string?

The goal I am trying to achieve is to hash some sensitive ID string (maybe joined together with some other fields like parent ID), so that I can use the hash value as an ID instead (for example in the database).

Example:

Resource ID: X123
Parent ID: P123

I don't want to expose the nature of my resource identifies to allow client to see "X123-P123".

Instead I want to create a new column hash("X123-P123"), let's say it's AAAZZZ. Then the client can request resource with id AAAZZZ and not know about my internal id's etc.

+1  A: 

What you are talking about is called a collision. Here is an article about SHA1 collisions: http://www.rsa.com/rsalabs/node.asp?id=2927

Edit: So another answerer beat me to mentioning the pigeon hole principle LOL, but to clarify this is why it's called the pigeon hole principle, because if you have some holes cut out for carrier pigeons to nest in, but you have more pigeons than holes, then some of the pigeons(an input value) must share a hole(the output value).

AaronLS
+6  A: 

Yes it is possible because of the pigeon hole principle.

Most hashes (also sha1) have a fixed output length, while the input is of arbitrary size. So if you try long enough, you can find them.

However, cryptographic hash functions (like the sha-family, the md-family, etc) are designed to minimize such collisions. The best attack known takes 2^63 attempts to find a collision, so the chance is 2^(-63) which is 0 in practice.

The pigeon hole principle: http://en.wikipedia.org/wiki/Pigeonhole_principle

Henri
I was just going to mention the pigeon hole principle when I edited my answer. LOL I am going to take it back out cause it looks like I copied you.
AaronLS
thanks, how about #3?
drozzy
The fact that the best attack takes 2^63 work doesn't mean that the chance of any two hashes colliding is 1 in 2^63 - all known attacks require constructing two colliding messages, not finding a preimage.
Nick Johnson
+13  A: 

What you describe is called a collision. Collisions necessarily exist, since SHA-1 accepts many more distinct messages as input that it can produce distinct outputs (SHA-1 may eat any string of bits up to 2^64 bits, but outputs only 160 bits; thus, at least one output value must pop up several times). This observation is valid for any function with an output smaller than its input, regardless of whether the function is a "good" hash function or not.

Assuming that SHA-1 behaves like a "random oracle" (a conceptual object which basically returns random values, with the sole restriction that once it has returned output v on input m, it must always thereafter return v on input m), then the probability of collision, for any two distinct strings S1 and S2, should be 2^(-160). Still under the assumption of SHA-1 behaving like a random oracle, if you collect many input strings, then you shall begin to observe collisions after having collected about 2^80 such strings.

(That's 2^80 and not 2^160 because, with 2^80 strings you can make about 2^159 pairs of strings. This is often called the "birthday paradox" because it comes as a surprise to most people when applied to collisions on birthdays. See the Wikipedia page on the subject.)

Now we strongly suspect that SHA-1 does not really behave like a random oracle, because the birthday-paradox approach is the optimal collision searching algorithm for a random oracle. Yet there is a published attack which should find a collision in about 2^63 steps, hence 2^17 = 131072 times faster than the birthday-paradox algorithm. Such an attack should not be doable on a true random oracle. Mind you, this attack has not been actually completed, it remains theoretical (some people tried but apparently could not find enough CPU power). Yet, the theory looks sound and it really seems that SHA-1 is not a random oracle. Correspondingly, as for the probability of collision, well, all bets are off.

As for your third question: for a function with a n-bit output, then there necessarily are collisions if you can input more than 2^n distinct messages, i.e. if the maximum input message length is greater than n. With a bound m lower than n, the answer is not as easy. If the function behaves as a random oracle, then the probability of the existence of a collision lowers with m, and not linearly, rather with a steep cutoff around m=n/2. This is the same analysis than the birthday paradox. With SHA-1, this means that if m < 80 then chances are that there is no collision, while m > 80 makes the existence of at least one collision very probable (with m > 160 this becomes a certainty).

Note that there is a difference between "there exists a collision" and "you find a collision". Even when a collision must exist, you still have your 2^(-160) probability every time you try. What the previous paragraph means is that such a probability is rather meaningless if you cannot (conceptually) try 2^160 pairs of strings, e.g. because you restrict yourself to strings of less than 80 bits.

Thomas Pornin
So... what you are saying is that for strings with length more than 80 bits the probably of at least one collision exists?What if I take 40 bit strings, or 20 bit strings? I know it is silly - but is there a function that will ensure that there are NO duplicates?I guess this is going to be just a regular mapping function, instead of a hash (i.e. substitute digit 1 with A etc..) - which makes my use of SHA1 kind of pointless for this purpose.
drozzy
What purpose? You haven't stated a purpose. There are cryptographic primitives that are permutations, such as block ciphers in ECB mode and RSA.
GregS
Sorry - added goal (purpose) to the question. basically hiding sensitive information.
drozzy
@drozzy: if you take 40-bit input strings, then there is a very low probability that two of them hash to the same value. But you cannot know for sure without trying them exhaustively (with 40-bit strings, this is doable, with a big PC and a few days or weeks). Of course, if you have no collision, then what you really have is a mapping from all your strings to distinct 160-bit values. For "hiding sensitive information", the keyword is "confidentiality" and you should research encryption instead of hashing.
Thomas Pornin
Added new question:http://stackoverflow.com/questions/2484068/simplest-method-of-hiding-sensitive-information
drozzy
I find it helps to realize that the odds of winning most national public lotteries is astoundingly small, yet the prize is often paid out each week or within a few weeks, so no matter how small the odds, they will occur. 2^-x may be a very small number, but there in no certainty either way, that collision will or won't occur. Or to be exact: 2^-x is neither one nor zero. Similarly, although the chances are slight, there is nothing to stop hash(1) matching hash(2) no matter what you do to constrain it.
andora
+1  A: 

A collision is almost always possible in a hashing function. SHA1, to date, has been pretty secure in generating unpredictable collisions. The danger is when collisions can be predicted, it's not necessary to know the original hash input to generate the same hash output.

For example, attacks against MD5 have been made against SSL server certificate signing last year, as exampled on the Security Now podcast episode 179. This allowed sophisticated attackers to generate a fake SSL server cert for a rogue web site and appear to be the reaol thing. For this reason, it is highly recommended to avoid purchasing MD5-signed certs.

spoulson