tags:

views:

55

answers:

3

I am making a pastebin type site and am trying to make the id be a random string like paste.com/4RT65L

I am getting the sha1 of the id before i add it to the database but i am getting the substring of the first 8 characters of the sha1. is their a possibility of their being a double copy of the same sha1? I dont want their to accidentaly be a second paste with an id that has already been used?

+1  A: 

Well the odds of having a collision in the 8 characters is significantly higher than having a collision with two Sha1 keys, but that doesn't mean it is likely that it will happen.

I would recommend that you do some testing on it. Generate random input and see how long it takes before you have a collision. If you like the results, then go with it. Otherwise, you'll need a longer string.

EDIT: You can also calculate the odds of a collision by looking at the Birthday Paradox.

Basically, if you are taking the first 8 hex digits from the SHA-1, then you have 16**8 (4,294,967,296) different available combinations.

Using an online Birthay Paradox calculator, after about 9200 hashes, you will have a 1% chance of a collision. It will take about 30,000 hashes before you have a 10% chance, and 77,000 before you have a 50% chance.

Its important to point out that as long as your hash function does a decent job of being pseudo-random, it doesn't matter which one you use (whether it is SHA1, MD5, or any form of Checksum)--these numbers assume completely random inputs, and thus you can only approach these values by using increasingly better hash functions.

So in the end, it depends on how much traffic you are expecting. If this is a small site, you can probably get away with it. If it is a large amount of traffic, then your odds of a collision are very high.

Stargazer712
I thought about doing that but i dont know how to go about writing a program that would match two exact strings. any ideas?
Robert
Generate completely random strings and calculate their hashes. Hash functions are (or at least they try to be) pseudo-random, so it should make no difference whether the input makes sense.
Stargazer712
Well, why not show us what "significantly higher" really amounts to?
stillstanding
@stillstanding, I did. Thanks for the meaningless downvote.
Stargazer712
What would I do if i had a site with large traffic? I dont but I would like to know what big websites like tinyurl would do? or do do.
Robert
@Robert, Like I mentioned, you have 4 billion options. If you find a collision add 1 to the number until you find one that is unique.
Stargazer712
none of this matters, since the sha1 hash of the record id can easily be guessed by calculating the sha1 hash of incremental numbers. Using sah1 is NOT random.
DGM
@DGM, yes we all know SHA1 is insecure, but he is not looking at using it for security purposes. Of course SHA1 is not random. It (like all hash functions) tries to be pseudo-random. As I mentioned in the post, this is the reason I said that hash functions can only *approach* the numbers I listed off.
Stargazer712
sha1 security is not the point.. the point is, why use it? just use a random string and a unique index on the column.
DGM
@DGM, the most common reason would be so that the values are reproducible. I don't know if that is a requirement, but that would be a reason.
Stargazer712
The first line of the question says random. SHA1 is not random.
DGM
@DGM, Random or pseudo-random? It doesn't sound like he knows the difference, and it doesn't sound like you recognize the difference. All commonly used random number generators are pseudo-random, and suffer the same limitations as SHA1. Best of luck to you.
Stargazer712
@Stargazer712 I most certainly do. Even pseudo-random is far more random than a hash of an autoincrement id. Especially if it comes from a new seed from the server every time. A standard call to the system rand() would probably come with a good enough result, coupled with a check for uniqueness. As I've been saying, SHA1 is NOT random. It is entirely predictable if you know it is based on an incrementing number. A "random" string, produced even with a pseudo random algorithm, is far, far less predictable, and by definition is random. I'm just trying to answer the original question.
DGM
A: 

Before assigning the id, you could always check that it isn't taken... or even better, put a unique id on the database field... problem solved. :)

Wait, you say SHA1 of the id. You don't mean the autoinc id do you? My first guesses would be:

356a192b
da4b9237
77de68de

If you are using a random id, why run sha1 on it?

DGM
autoinc id on the database, i want the actual id people see to be random so that they dont see other peoples posts. like right now it is id=45 and they can just change it to 0-45 and see all of those posts. overall this is just for the knowledge, i dont expect to get more than 200 posts, but would like it to be as well written as tinyurl would be
Robert
If you want the url to be random, then you do NOT want a hash. To see your id=45, I'd just enter fb644351. Generate a *real* random string and store it in the record with a unique index, and then search for that when the URL is received.
DGM
A: 

I figured it out, my code is:

strtoupper(substr(sha1($token_start . $id . $token_end), 0, 8))

where $id is the id which is obtained be finding out what the total amount of id's are in the database + 1, being the next id since it is auto increment.

then when it inserts the entry it inserts the encrypted.

$token_start and $token_end are both random strings you can choose to make the new id unique.

I made a loop which inserted them 32 000 times into a database, just the id, autoincrement along with the new id and i did a search with distinct and didnt get any dublicates. thats more than enough for me. Any comments would be helpful. I dont know how long it would take untile it would give me a collision. if anybody knows when the first one would be that would be awesome.

Robert
As I mentioned, at the 30k mark, you had about a 10% chance of collision. You cannot guarantee when a collision will occur, because it is based on chance. At 77k, you will have a 50-50 chance.
Stargazer712