Are there circumstances where a hash algorithm can be guaranteed unique?

views:

241

answers:

+4 Q:

Are there circumstances where a hash algorithm can be guaranteed unique?

If I'm hashing size-constrained similar data (social security numbers, for example) using a hash algorithm with a larger byte size than the data (sha-256, for example), will the hash guarantee the same level of uniqueness as the original data?

If you're using a cryptographic hash like SHA, then the short answer is yes.

Die in Sente 2010-02-19 21:53:39

Thanks. I was thinking so, but I couldn't find a reference to back it up and I'm not smart enough to dig into the math and conclude one way or the other!

matt 2010-02-19 21:58:38

As noted above, a cryptographic hash just says that collisions are extraordinarily unlikely, not impossible.

Novelocrat 2010-02-19 22:25:10

@Novelcrat, The *short* answer to the original question is yes. While in theory a collision is possible, the mean time to finding a collision is considerably longer than the time it will take the sun to evolve into a red giant and destroy the earth.

Die in Sente 2010-02-21 15:52:46

@Novelcrat. P.S. If you can post two 10-digit SSNs that produce the same SHA-256 hashes, I will pay you $1,000 US.

Die in Sente 2010-02-21 16:02:11

+3 A:

You can always create a customized hash that guarantees uniqueness. For data in a known domain (like SSN's), the exercise is relatively simple.

If your target hash value actually has more bits available than what you're hashing, the hash simply maps input values to one of the available output values. This will be a simple linear mapping from input value as a multi-byte integer to the output as a multi-byte integer.

When your target hash value has fewer bits than what's being hashed, then uniqueness cannot ever be guaranteed.

S.Lott 2010-02-19 22:01:34

Thanks. I am considering hashing ssn's and an "account" identifier which may vary with each implementation. So if I can use a hash function instead of a pre-generated, it would be preferable.

matt 2010-02-19 22:31:02

If masking social security numbers is the goal, then implementing a one to one linear mapping function would not suffice, as it would be rather easy to calculate the original input from some samples of the output. Besides, the length of the input string definitely does not affect the effectiveness of a cryptographically secure hash function, so using a known hash algorithm is the way to go

Silvio Donnini 2010-02-19 23:21:10

+1 A:

One key feature of a cryptographically secure hash function is that you are safe from collisions beyond reasonable doubt, regardless of the input. This is also valid for input shorter than the output's size, which is the same of a longer message with little entropy. So you can use SHA-2 without worrying about collisions.

Silvio Donnini 2010-02-19 22:29:07

The probability of a hash collision has nothing to do with the size of the input string (except to the extent that it indicates how many inputs you need to keep uniqueness among). It's possible to have a hash collision when you hash 0 and 1 using a perfect hash algorithm, although the possibility is 1/(2^bit-length). Which in the case of SHA-256 is effectively zero.

Hash collisions are a birthday paradox problem. In the case of a 256 bit hash, the probability of a collision among two inputs is purely dependent on the count of inputs and is:

1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!) or as others have said -- basically zero for reasonable numbers of inputs.

Michael Mullany 2010-02-19 22:36:46

True. I'm not questioning the security implications, though. I'm asking for probability of uniqueness from a hash when the size of the data is less than the size of the hash. (I need the resulting value to be deterministic/repeatable, so performing a random salt of x bytes doesn't work for me. I might "salt" by adding constant characters per implementation - for instance, I might append characters like '593jra' to the ssn before hashing).

matt 2010-02-19 22:49:12

Isn't the birthday paradox based on the pigeonhole principle? If so, in theory I don't have a pigeonhole scenario.

matt 2010-02-20 01:14:47

The pigeonhole principle is the simple notion that when you have more items than pigeonholes, you're guaranteed to have a collision. The birthday paradox just says you're really really likely to get a collision if your ratio of items to pigeonholes is "high". Where "high" is defined by the above formula.

Michael Mullany 2010-02-27 01:37:38

+1 A:

Others have pointed out that collisions should not be a concern; that is the whole point of cryptographically secure hash functions. I would just like to add the following:

If your input set is small enough (e.g. data is SSN -- there are less than a billion of them), then the absence of collision is amenable to verification: just test it exhaustively.
If the input set is too big to be exhaustively scanned, then it is expected that the absence of collision cannot be proven. Good hash functions are expected to act as random oracles, and on a random oracle you cannot prove such a property without trying exhaustively. Being able to prove the absence of collision would suspiciously look like a weakness of the function.

Thomas Pornin 2010-02-22 02:44:49

ansaurus

tags:

views:

answers:

Are there circumstances where a hash algorithm can be guaranteed unique?

related questions