If I'm hashing size-constrained similar data (social security numbers, for example) using a hash algorithm with a larger byte size than the data (sha-256, for example), will the hash guarantee the same level of uniqueness as the original data?
views:
241answers:
5If you're using a cryptographic hash like SHA, then the short answer is yes.
You can always create a customized hash that guarantees uniqueness. For data in a known domain (like SSN's), the exercise is relatively simple.
If your target hash value actually has more bits available than what you're hashing, the hash simply maps input values to one of the available output values. This will be a simple linear mapping from input value as a multi-byte integer to the output as a multi-byte integer.
When your target hash value has fewer bits than what's being hashed, then uniqueness cannot ever be guaranteed.
One key feature of a cryptographically secure hash function is that you are safe from collisions beyond reasonable doubt, regardless of the input. This is also valid for input shorter than the output's size, which is the same of a longer message with little entropy. So you can use SHA-2 without worrying about collisions.
The probability of a hash collision has nothing to do with the size of the input string (except to the extent that it indicates how many inputs you need to keep uniqueness among). It's possible to have a hash collision when you hash 0 and 1 using a perfect hash algorithm, although the possibility is 1/(2^bit-length). Which in the case of SHA-256 is effectively zero.
Hash collisions are a birthday paradox problem. In the case of a 256 bit hash, the probability of a collision among two inputs is purely dependent on the count of inputs and is:
- 1 - (2^256)! / ((2^256^inputcount) * (2^256-inputcount)!) or as others have said -- basically zero for reasonable numbers of inputs.
Others have pointed out that collisions should not be a concern; that is the whole point of cryptographically secure hash functions. I would just like to add the following:
- If your input set is small enough (e.g. data is SSN -- there are less than a billion of them), then the absence of collision is amenable to verification: just test it exhaustively.
- If the input set is too big to be exhaustively scanned, then it is expected that the absence of collision cannot be proven. Good hash functions are expected to act as random oracles, and on a random oracle you cannot prove such a property without trying exhaustively. Being able to prove the absence of collision would suspiciously look like a weakness of the function.