tags:

views:

698

answers:

11

Hello all,

I have just found this great tutorial as it is something that I need.

However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?

Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?

function generate_chars()

{

    $num_chars = 4; //max length of random chars
    $i = 0;
    $my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
    $keys_length = strlen($my_keys);
    $url  = "";
    while($i<$num_chars)
    {
     $rand_num = mt_rand(1, $keys_length-1);
     $url .= $my_keys[$rand_num];
     $i++;
    }
    return $url;
}

function isUnique($chars)

{
    //check the uniqueness of the chars
    global $link;
    $q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
    $r = mysql_query($q, $link);
    //echo mysql_num_rows($r); die();
    if( mysql_num_rows($r)>0 ): 
     return false;
    else: 
     return true;
    endif;
}
A: 

That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.

Peter
N(1)? Did I miss something in the Data Structures class at university? Is that like O(1) or Omega(1) or Theta(1)?
DrJokepu
I mean O(1). Sorry!
Peter
It is hard for me to write! Do you know of any references that has already written this sort of function that I can make use of!
Abs
PHP has hashing library: http://php.net/manual/en/book.hash.php
Peter
Ah wait, a Hash, that won't be short and therefore will not work for a tiny url service, right?
Abs
A: 

Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value

David Dorward
See the reply I made to balpha. This method will not be a tiny url any more once large records are reached in the database.
Abs
+1  A: 

You could of course add ordering by simply numbering the urls:

http://mytinyfier.com/1
http://mytinyfier.com/2

and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.

balpha
If I have 1000000 records then the next record would have a URL of http://mytinyfier.com/1000001 - and that isn't really tiny. I am not sure if this is what you meant?
Abs
You don't have to just use the digits 0-9 for the representation in the URL.
balpha
+3  A: 

Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?

A snippet from Channel9:

The formula is simple, just turn the Entry ID of our post, which is a long into a short string by Base-36 encoding it and then stick 'http://ch9.ms/' onto the front of it. This produces reasonably short URLs, and can be computed at either end without any need for a database look up. The result, a URL like http://ch9.ms/A49H is then used in creating the twitter link.

EvilChookie
You must define the base36 number system. Can you please explain how do you encode a URL intto base36 with an example ?
this. __curious_geek
I too think it would be great to have an example.
Abs
I didn't come up with the solution, it was simply something I saw in passing while visiting Channel9 this morning. As such, my answer might be slightly technically inaccurate.
EvilChookie
-1: they're only suggesting you Base36 encode the Entry ID of a given post, not the entire URL. This isn't a general URL shortening system, it's just a way to get shorter URLs for a single website, namely http://ch9.ms
Dennis Palmer
+8  A: 

In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.

UNIQUE KEY `unique_chars` (`unique_chars`)

Don't rush to do premature optimization on something that you think might be slow.

Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.

Dennis Palmer
A: 

I wouldn't bother doing ordered enumeration for two reasons:

1) SQL servers are very effective at checking such hash collisions (given correct indexes)

2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.

+7  A: 

I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".

Obviously, as the documents fill, your odds increase for the chance of a collision.

With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).

That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.

It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?

But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.

Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.

Will Hartung
No matter the size of the key and the probability of a collision, you still need to check for the collision. Doing so on a uniquely indexed DB field is probably the best way to go.
Dennis Palmer
+8  A: 

The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.

Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).

If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.

(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.

Talljoe
+2  A: 

I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)

this. __curious_geek
How many characters were the keys made up from?
Abs
36 characters. 0-9,a-z. a decimal 10 will be a when converted base64 if your base64 number system is [0123456789abcde...z]. You can change the order provided you do not repeat a character. In the order of mentioned base36 number system, the decimal 1000 will be represented as 'qr'. I assume you know how to convert a decimal number into any other base. It is done by division method, dividing the decimal number by the base of target number system.
this. __curious_geek
I actually used base35 to eliminate the confusion between 0 and o.I removed o.
this. __curious_geek
A: 

Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).

That said, md5( rand( 0, n ) ) works too.

Christopher W. Allen-Poole
The main point is for the URL to be short.
Abs
Technically, it's actually a one in 100 chance that two people submitting at the exact same time would be a perfect match: Choose a random variable for the first person; the second person has a 1 in 100 chance of getting the same random number.This also doesn't address the (relatively small) probability of hash collisions in MD5... but according to http://en.wikipedia.org/wiki/Birthday_attack , the chance of having a single collision with 10^18 documents is <1%. Assuming alot of things.
CoderTao
That is only true if somehow they managed to submit at the exact same Unix timestamp. If they are submitting in the same second (which is my exact wording, not time), they have a 1/1000 chance of hitting the same millisecond. In addition to that, they will only have a 1/100 chance of getting the same value. This leaves a final probability of < 1/1000 of 1% per second.
Christopher W. Allen-Poole
A: 

Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps

You can use any base you like, for example to convert 554512 to base 62, call

$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.

So, just pass in the unique id of the database record.

In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.

postpostmodern