views:

229

answers:

4

What I am trying to do is create a 12 character id for articles on my website similar to how youtube handles their video id (http://www.youtube.com/watch?v=53iddd5IcSU). Right now I am generating an MD5 hash and then grabbing 12 characters of it like this:

$ArticleId = substr(MD5("Article".$currentID),10,12)

where $currentID is the numeric ID from the database (eg 144)

I am slightly paranoid that I will run into a duplicate $ArticleId, but realistically what are the chances that this will happen? And also, being that the column in my database is unique, how can I handle this rare scenario without having an ugly error thrown?

P.S. I made a small script to check for duplicates within the first 5000 $ArticleId's and there were none.

EDIT: I don't like the way the base64_encode hashes look so I did this:

function retryAID($currentID)
{
    $AID = substr(MD5("Article".$currentID*2),10,12);

    $setAID = "UPDATE `table` SET  `artID` =  '$AID' WHERE `id` = $currentID ";
    mysql_query($setLID) or retryAID($currentID);
}


$AID = substr(MD5("Article".$currentID),10,12);

$setAID = "UPDATE `table` SET  `artID` =  '$AID' WHERE `id` = $currentID ";
mysql_query($setAID) or retryAID($currentID);

Since the AID column is unique the mysql_query will throw an error and the retryAID function will find a unique id...

A: 

No not very unique.

Why not base64 encode it if you need it shorter?

Louis
I think he wants to obfuscate it.
RageZ
+6  A: 

What's wrong with using a sequential id? The database will handle this for you.

That aside, 12 characters is still 96 bits. 296 = 79228162514264337593543950336 possible hashes. Even though MD5 is known to have collision vulnerabilities, there's a world of difference between the possibility of a collision and the probability of actually seeing one.

Update:

Based on the return value of the PHP md5 function you're using, my numbers above aren't quite right.

Returns the hash as a 32-character hexadecimal number.

Since you're taking 12 characters from a 32-character hexadecimal number (and not 12 bytes of the 128-bit hash), the actual number of possible hashes you could end up with is 1612 = 281474976710656. Still quite a few.

Bill the Lizard
simplest answer is almost always the best!
Mitch Wheat
The md5 is only base-16, so there really are "only" 16^12 possible values (281,474,976,710,656). The likelihood of collision would depend on how variable those 12 sequential values of the md5 hash are. (Edit: n/m, I guess you already pointed that out!)
konforce
MD5 collision vulnerabilities are not relevant to this application.
GregS
A: 

How about UUID ?

http://php.net/manual/en/function.uniqid.php

dschulz
+1  A: 
<?php
  function get_id()
  {
    $max = 1679615; // pow(36, 4) - 1;
    $id = '';

    for ($i = 0; $i < 3; ++$i)
    {
      $r = mt_rand(0, $max);
      $id .= str_pad(base_convert($r, 10, 36), 4, "0", STR_PAD_LEFT);
    }
    return $id;
  }
?>

Returns a 12 character number in base-36, which gives 4,738,381,338,321,616,896 possibilities. (The probability of collision depends on the distribution of the random number generator.)

To ensure no collisions, you'll need to loop:

<?php
do {
  $id = get_id();
} while ( !update_id($id) );
?>
konforce
Can you explain your intentions with the use of the str_pad function? It doesn't appear to do anything. I'm guessing it's to ensure the base_convert result is definitely 4 characters? Or possibly to typecast to string?
Atomix
The padding is there to make sure each of the three pieces is exactly four characters long. e.g., `base_convert(0, 10, 36)` would yield `0`, but with the padding it would be `0000`.
konforce
You would expect a collision after 36**6 = 2,176,782,336 calls to get_id(). That's a big number, but I'd still go with your ensure-no-collisions-loop.
GregS