views:

241

answers:

5

If I'm hashing size-constrained similar data (social security numbers, for example) using a hash algorithm with a larger byte size than the data (sha-256, for example), will the hash guarantee the same level of uniqueness as the original data?

A: 

If you're using a cryptographic hash like SHA, then the short answer is yes.

Die in Sente
Thanks. I was thinking so, but I couldn't find a reference to back it up and I'm not smart enough to dig into the math and conclude one way or the other!
matt
As noted above, a cryptographic hash just says that collisions are extraordinarily unlikely, not impossible.
Novelocrat
@Novelcrat, The *short* answer to the original question is yes. While in theory a collision is possible, the mean time to finding a collision is considerably longer than the time it will take the sun to evolve into a red giant and destroy the earth.
Die in Sente
@Novelcrat. P.S. If you can post two 10-digit SSNs that produce the same SHA-256 hashes, I will pay you $1,000 US.
Die in Sente
+3  A: 

You can always create a customized hash that guarantees uniqueness. For data in a known domain (like SSN's), the exercise is relatively simple.

If your target hash value actually has more bits available than what you're hashing, the hash simply maps input values to one of the available output values. This will be a simple linear mapping from input value as a multi-byte integer to the output as a multi-byte integer.

When your target hash value has fewer bits than what's being hashed, then uniqueness cannot ever be guaranteed.

S.Lott
Thanks. I am considering hashing ssn's and an "account" identifier which may vary with each implementation. So if I can use a hash function instead of a pre-generated, it would be preferable.
matt
If masking social security numbers is the goal, then implementing a one to one linear mapping function would not suffice, as it would be rather easy to calculate the original input from some samples of the output. Besides, the length of the input string definitely does not affect the effectiveness of a cryptographically secure hash function, so using a known hash algorithm is the way to go
Silvio Donnini
+1  A: 

One key feature of a cryptographically secure hash function is that you are safe from collisions beyond reasonable doubt, regardless of the input. This is also valid for input shorter than the output's size, which is the same of a longer message with little entropy. So you can use SHA-2 without worrying about collisions.

Silvio Donnini
A: 

The probability of a hash collision has nothing to do with the size of the input string (except to the extent that it indicates how many inputs you need to keep uniqueness among). It's possible to have a hash collision when you hash 0 and 1 using a perfect hash algorithm, although the possibility is 1/(2^bit-length). Which in the case of SHA-256 is effectively zero.

Hash collisions are a birthday paradox problem. In the case of a 256 bit hash, the probability of a collision among two inputs is purely dependent on the count of inputs and is:

  • 1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!) or as others have said -- basically zero for reasonable numbers of inputs.
Michael Mullany
True. I'm not questioning the security implications, though. I'm asking for probability of uniqueness from a hash when the size of the data is less than the size of the hash. (I need the resulting value to be deterministic/repeatable, so performing a random salt of x bytes doesn't work for me. I might "salt" by adding constant characters per implementation - for instance, I might append characters like '593jra' to the ssn before hashing).
matt
Isn't the birthday paradox based on the pigeonhole principle? If so, in theory I don't have a pigeonhole scenario.
matt
The pigeonhole principle is the simple notion that when you have more items than pigeonholes, you're guaranteed to have a collision. The birthday paradox just says you're really really likely to get a collision if your ratio of items to pigeonholes is "high". Where "high" is defined by the above formula.
Michael Mullany
+1  A: 

Others have pointed out that collisions should not be a concern; that is the whole point of cryptographically secure hash functions. I would just like to add the following:

  • If your input set is small enough (e.g. data is SSN -- there are less than a billion of them), then the absence of collision is amenable to verification: just test it exhaustively.
  • If the input set is too big to be exhaustively scanned, then it is expected that the absence of collision cannot be proven. Good hash functions are expected to act as random oracles, and on a random oracle you cannot prove such a property without trying exhaustively. Being able to prove the absence of collision would suspiciously look like a weakness of the function.
Thomas Pornin