tags:

views:

2910

answers:

11

I have some code on my PHP powered site that creates a random hash (using sha1()) and I use it to match records in the database. This is to obfuscate things like id's in query strings which can (and usually are) sequential, and I don't want anyone changing it in the query string to get other information.

What are the chances of a collision? Should I generate the hash, then check first if it's in the database (I'd rather avoid an extra query) or automatically insert it, based on the probability that it probably won't collide with another.

Thank you!

+7  A: 

If you assume that SHA-1 does a good job, you can conclude that there's a 1 in 2^160 chance that two given messages have the same hash (since SHA-1 produces a 160-bit hash).

2^160 is a ridiculously large number. It's roughly 10^48. Even if you have a million entries in your database, that's still a 1 in 10^42 chance that a new entry will share the same hash.

SHA-1 has proved to be fairly good, so I don't think you need to worry about collisions at all.

As a side note, use PHP's *raw_output* feature when you use SHA-1 as this will lead to a shorter string and hence will make your database operations a bit faster.

EDIT: To address the birthday paradox, a database with 10^18 (a million million million) entries has a chance of about 1 in 0.0000000000003 of a collision. Really not worth worrying about.

Artelius
For all those that really believe in collision freedom, please remember the Birthday effect. Your first collision is randomly more likely to occur than you might imagine. So be careful anyways
Robert Gould
Yeah but one collision won't be the thing killing your system. Your own bug will. I don't think something that happen randomly once in a decade, except in a nuclear factory, would be something we should worry about. If I could have only that kind of annoyance ... ;-)
e-satis
But when there is a clash. Can you imagine trying to debug the problem!
Martin York
The first collision has a 50% chance of happening after the first 2^80 hashes.
Seun Osewa
@Seun: No, that's completely wrong. Read about the birthday paradox.
Artelius
+4  A: 

SHA-1 produces 160 bit long digest. Therefore you are safe as long as you have less than 2^(160/2) entries. Division by 2 is due to birthday paradox.

Szere Dyeri
"Safe" is surely a relative term. It's not that it's "safe" up to some point and "unsafe" afterwards. It only makes sense to talk about probabilities of collisions at certain points. The OP may need a "one in a million chance or better" or he may need "one in a billion".
Jon Skeet
@Szere Dyeri Remember randomness is not predictable :)
Robert Gould
Jon, you are right. To be more precise, expected number of N-bit hashes that can be generated before a collision is 2^(N/2), where expectation is a formal first order statistics of the distribution.
Szere Dyeri
+2  A: 

From first principles:

SHA-1 produces a 160-bit digest. Assuming it uses the entire bit-space evenly (which is presumably what it was designed to do), that is only a 2^-160 chance on each insert that you would get a collision.

So for each insert, it should be safe to assume there is no collision, and deal with the error if there is.

That is not to say that you can ignore the chance of collision entirely.

The Birthday Paradox suggests the chance of there being at least one collision in your database is higher than you would guess, because of the O(N^2) possible collisions.

Oddthinking
The Birthday Paradox brings the chance of a collision up to 0.00000000000000000017347234759768070944119244813919%. Really not worth worrying about.
Jeff Hubbard
Jeff, I concede that the collision risk can be ignored in almost all cases. I didn't do the maths before.However, you fail to mention how many objects there are in the collection, so your estimation of the chance of collision is kind of meaningless.
Oddthinking
+3  A: 

If you use numerically increasing IDs as input, then chances are practically zero that SHA-1 will collide.

If the ID is the only input, then SHA-1 seems to be quite some overkill - producing a 160 bit hash from a 32-bit integer. I would rather use modular exponentiation, e.g. chose a large (32-bit) prime p, compute the modular generator g of that group, and then use g^id. This will be guaranteed collision free, and only give 32-bit "hashes".

Martin v. Löwis
The id is not the only input. There is some specific data and time() rand() to mix things up a bit.
alex
Then just generating 160 random bits will be sufficiently unique - no need to generate a hash of anything (it won't become more unique through the hash, nor will it become more random).
Martin v. Löwis
+1  A: 

Ask the question what will it cost you if there is a collision. If this is a free site fine. If you are running a money making business and an overrite will cost you a million dollar contract then I would think again.

I think you are going about this the wrong way.
I think you need to keep the unique ID but you want to make sure that the users can not manually change the ID.

One way to to do this is to put the ID and the hash of the ID (with some extra data) in the link.

For Example: (my PHP is rusty so general algorithm would be:)

id   = 5;
hash = hash("My Private String " + id)
link = "http://mySite.com/resource?id=" + id + "&hash=" + hash

Then when you receive a request just validate that you can regenerate the hash from ID. This does leave you open to an attack to work out "My Private String" but that will be quite computationally hard and you could always append something else unique that is not directly available to the user (like the session ID).

Martin York
A: 

The other comments have covered you on the probabilities, however if you look at this pragmatically then you can get a definite answer for yourself.

You said yourself that you are going to be hashing your sequential IDs. It would be easy to code up a test case. Iterate through ~100,000,000 ids and check for collisions. That would not take long to do. On the other hand you might run out of memory quarter of the way through.

Josh
+9  A: 

Use a symmetric encryption scheme and a private server key to encrypt the ID (and other values) when you send them to the client and decrypt again on reception. Take care that your cryptographic function provides both confidentiality and integrity checks.

This allows you to use sensible values when talking to the DB without any collision, great security when talking to the client and reduces your probability to land on thedailyWTF by approximatly 2^160.

See also Pounding A Nail: Old Shoe or Glass Bottle?!

David Schmitt
+1 for the sanity check
Robert Gould
+1 for expressing what others are too cowerd to tell.
CDR
+5  A: 

why not do something which guarantees there'll be no collisions, as well as makes sure that no one can change a GET parameter to view something they shouldn't: using a salt, combine the id and its hash.

$salt = "salty";
$key = sha1($salt . $id) . "-" . $id;
// 0c9ab85f8f9670a5ef2ac76beae296f47427a60a-5

even if you do accidentally stumble upon two numbers which have the exact same sha1 hash (with your salt), then the $key will still be different and you'll avoid all collisions.

nickf
preferably use HMAC (hash_hmac in PHP) which, we're told, addresses some weaknesses with this kind of simple scheme. http://en.wikipedia.org/wiki/HMAC
araqnid
+1 no reason to use an "extremely low chance of collision" when there is such a solution to avoid them at all at pratically the same cost
Lo'oris
A: 

I don't think sha1() is going to give you any trouble here, weak random number generation is a more likely candidate for collisions.

Stefan Esser wrote good article on the topic.

Waquo
+1  A: 

If you have to obfuscate some data in your url to hide data, you are doing something wrong.

Arkh
Why? imagine a scenario where you are selling digital goods and these goods can be accessed from an API. some are priced and some arent. This is the best way to refer to them via the url without the user changing the urls and getting other "non-authorized" applications to download.
Faisal Abid
Or you could implement access levels and check if people have access to your data before sending it to them blindly. Yeah, you need to put some work to do it, but you're paid for that, not to implement security by obscurity which has already failed enough.Never trust data coming from the user.
Arkh
I'm inclined to agree with this. Though there are cases where hashing data and being concerned about its uniqueness is important (Mercurial IDs come to mind), if you _have to_ hide your id's for security reasons, that's a very dangerous security model. And if you don't need to do so, why bother?
dimo414
A: 

Hash(UUID + Hash(UUID));

Faisal Abid
Why a -1, An explaination would be usefull
Faisal Abid
nonsense: if you still hash, you still can have collisions.
Lo'oris
could even *raise* the probability of a collision
ZJR