tags:

views:

48

answers:

2

I have a randomly generated link id which is 8 digits long with 50 values per digit, which means 4x10^13 possible combinations (i think). I have about ten-thousands of queries per day.

My question is, should I check 4 tables each query for duplicates, or skip it? or make it 10 digits so that it definitely won't be a match?

edit:

my (probably copied) generator

// START Generates Random String
function genRandString($len=8){
$base='ABCDEFGHKLMNPQRSTWXYZabcdefghjkmnpqrstwxyz23456789';
$max=strlen($base)-1;
$randstring1 ='';
mt_srand((double)microtime()*1000000);
while (strlen($randstring1)<$len+1)
$randstring1.=$base{mt_rand(0,$max)};
return $randstring1;
}
// END Generates Random String
+2  A: 

It depends on the quality of the pseudorandom number generator. You might have insufficient entropy so you're more likely to get a collision that you realize.

Is there a reason you aren't using UUID()? It seems like the best solution that is designed for this purpose.

Anyway, I don't recommend checking for duplicates before you insert. This is subject to race conditions, that is, someone could insert the duplicate value in the moment after you check, but before you insert. So you'd have to handle duplicate key violation exceptions anyway. Better to just try the insert (without checking first) and handle exceptions as needed.


Re your comments and your algorithm: I would not use that hashing scheme. You have fewer than 24 bits of information in four digits of 50 distinct values. So your chance of collision is pretty significant once you have a few thousand rows in your database.

How about this solution: Use a monotonically increasing primary key value, e.g AUTO_INCREMENT. To change this number into an alphanumeric string, use base_convert():

$id = 12345678;
$str = base_convert($id, 10, 36);
echo "$str\n";

The result is 7clzi.

If you are worried about confusion in letters like 1, l, i, 0, o, you can do some custom substitutions:

$from = array('1', 'l', 'i', '0', 'o');
$to   = array('A', 'B', 'C', 'D', 'E');
$str = str_replace($from, $to, $str);

Now the value 12345678 is converted to 7cBzC. When someone requests a page by this code, do the conversion in reverse:

$code = str_replace($to, $from, $code);
$id = base_convert($code, 36, 10); 
Bill Karwin
I added my generator above if you wouldn't mind checking. The reason I'm not using UUID is that the ids are used at the end of links mypage.com/?q=RADF23fj09. No race condition, there would be like literally 1 millisecond for a 1 in a trillion duplicate to occur.
Derek
Great solution. My mind is blown by base_convert (http://en.wikipedia.org/wiki/Base_36 helps). So this is what TinyURL does... Thanks!
Derek
One more thing Mr. Karwin, my ids are currently 123 by default, is it necessary to have the id as 000000123 in the database?
Derek
No, the base conversion is by integer value, not by string. So 123 would a shorter code like: `3f`. As your integer values get larger, the code gets longer.
Bill Karwin
A: 

prefix it with a timestamp and skip checking it

Crayon Violent