views:

212

answers:

3

Theoretically does hashing a unique value yield a unique value?

Let's say I have a DB table with 2 columns: id and code. id is an auto-incrementing int and code is a varchar. If I do ...

$code = sha1($id);

... and then store $code into the same row as $id. Will my code column be unique as well?

What about if I append the current time? eg:

$code = sha1($id . time());

Thanks.

A: 

It depends on the hashing algorithm. But theoretically, unless the hash is exactly the same as the original string there is a potential for the hash to not be unique.

A hash of a value is a condensed representation of the original value. By removing pieces of information to create the hash you are losing parts of what make it unique in the domain and therefore increasing the probability that the value will not be unique. The only way to guarantee that it will be unique is to use the original value itself which defeats the purpose of hashing.

Andrew Hare
"The only way to guarantee that it will be unique is to use the original value itself" - clearly not true!
Martin Smith
@Martin - Please explain what you mean, _why_ is my statement untrue?
Andrew Hare
The hash would clearly not have to be the original string for uniqueness to be guaranteed. It could be the original string ROT13 ed for example and still be unique.
Martin Smith
c.f. Perfect Hash Functions http://en.wikipedia.org/wiki/Perfect_hash_function
Martin Smith
A: 

There is a small possibility that two different values give the same hash. Although very small, it's not unlikely.

Ikke
+5  A: 

In general, the answer is no. This is trivial to show: SHA-1 has 2^160 different outputs - 160 bits, but there are many more inputs that that (e.g., there are 2^320 different 40-byte strings, and they can't all map to a unique output).

Given a sufficient subset of values, the answer is maybe. It depends on the exact algorithm and the size of the subset: if the number of possible inputs is smaller than the number of possible outputs, then it is possible (but NOT guaranteed). When thinking about this, it may be helpful to keep the birthday paradox in mind: the probability of a collision does not increase linearly with the number of inputs.

Michael Madsen
Thanks. So the only way to get unique values is to generate and scan through the DB to check if it exists (repeat if yes)? That's pretty much what I was trying to avoid doing here, but I guess it's the only way.
Nebs
Unfortunately, there's no other way if you want to guarantee a unique value. This is also why you can't easily reverse a hash: I can tell you that the SHA-1 hash for "1" is "356a192b7913b04c54574d18c28d46e6395428ab", but there are many other values that will generate that hash.
Michael Madsen
I see. The thing is I probably won't need to generate more than say 1000 unique values. Would it be safe to say that all the values will be unique in this case?
Nebs
No. Even with just two values, there is still a 1 in 2^160 chance that they hash to the same value. That may seem infinitely small, but it's still bigger than 0, which means that, although it is *likely* that your 1000 values all hash to different values, it cannot be guaranteed without testing. If you know the possible values in advance, you can calculate the hash for all of them, put them in a list, and sort it by the hash - that will make it easy to find duplicates.
Michael Madsen
You're right, thanks. I ended up going with uniqid as per RenderIn's suggestion.
Nebs