tags:

views:

1368

answers:

10

I've always wondered how and why they do this...an example: http://youtube.com/watch?v=DnAMjq0haic

How are these IDs generated such that there are no duplicates, and what advantage does this have over having a simple auto incrementing numeric ID?

How do one keep it short but still keep it's uniqueness? The string uniqid creates are pretty long.

+2  A: 

Auto-incrementing can easily be crawled. These cannot be predicted, and therefore cannot be sequentially crawled.

I suggest going with a double-url format (Similar to the SO URLs):

yoursite.com/video_idkey/url_friendly_video_title

If you required both the id, and the title in the url, you could then use simple numbers like 0001, 0002, 0003, etc.

Generating these keys can be really simple. You could use the uniqid() function in PHP to generate 13 chars, or 23 with more entropy.

Jonathan Sampson
That's just the thing though... SO's URL scheme doesn't require the title at all, look: http://stackoverflow.com/questions/1076110/
John T
If crawling is your concern, you can force it to require both parts.
Mark
I know - you don't have to require it either. But if you want to keep them from crawling 1-1,000,000, require the title too.
Jonathan Sampson
+2  A: 

A way to do it is by a hash function with unique input every time.

example (you've tagged the question with php therfore):

$uniqueID = null
do {
  $uniqueID = sha1( $fileName + date() );
} while ( !isUnique($uniqueID) )
dimitris mistriotis
Not *guaranteed* unique :) Very unlikely to have a collision though. A complete solution would at least check for collisions, and regenerate a new ID if there is one though.
Mark
you are completly correct I had it implemented for an application recently, and forgot to write it here!Let's refactor the anwser...
dimitris mistriotis
Very unlikely in this case being that it takes an intractable amount of computation to find two that collide at all. Even assuming 2^63 hashes or so would be required to get a 50% chance of collision by granting that cryptanalytic attacks model trivial redundancy in the hash function. The numbers are huge: 9.22337204 × 10^18, so if you generate ~ 4 billion keys with this you still only have a 1 in 4 billion chance of having a collision between them. And if you don't grant that the cryptographic vulnerabilities directly increase collision chance then those odds are much more in your favor.
Edward Kmett
@Edward: Still worth the extra two lines of code :)@dimitris: you mean while !unique, don't you?
Mark
@Mark: of course
dimitris mistriotis
@Mark: right up until you are generating this in a distributed system and only have snapshot consistency guarantees because you can't keep up with the load, I agree. ;)
Edward Kmett
+1  A: 

There should be a library for PHP to generate these IDs. If not, it's not difficult to implement it.

The advantage is that later you won't have name conflicts, when you try to reorganize or merge different server resources. With numeric ids you would have to change some of them to resolve conflicts and that will result in Url change leading to SEO hit.

User
-1 for being too lazy to google "php guid" and find the appropriate function, +2 for mentioning server conflicts. I'm sure google uses some sort of server cloud, could be a big reason why they use IDs like that on youtube.
Mark
Mark: actually I wouldn't necessarily recommend GUIDs. They are designed to be collision free, but not to be hard to predict, so to meet all the desired characteristics of these identifiers you may want something stronger.
Edward Kmett
I thought Mastermind was referring to GUIDs in his first sentence. Didn't mean that they are a good solution :D
Mark
+1  A: 

So much of this depends on what you need to do. How 'unique' is unique? Are you serving up the unique ID's, and do they mean something in your DB? if so, a sequential # might be ok.

ON the other hand, if you use sequential #'s someone could systematically steal your content by iterating thru the numbers.

There are filesystem commands that will generate unique file names - you could use those.

Or GUID's.

n8wrl
+1 for guids. They're awesome
Michael Haren
GUIDs work well if you don't require that they be unpredictable. But since a large part of their non-collision guarantee comes from well known portions of them coming from MAC addresses, etc. you explicitly can't rely on them not having a guessable progression.
Edward Kmett
@Edward Kmett: UUIDs (of which GUIDs are a specific subset's implementation) can use several different algorithms, only one of them use MAC addresses or other well-known sources. there's even one where almost all bits are from random sources.
Javier
Try version 4 UUIDs which are randomly generated. http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_4_.28random.29
Craig McQueen
Ah good point. 122 bits of random goodness. I can't complain if you're sure what you're using to get UUIDs are generating randomly generated ones, which I just realized that Microsoft's NewGuid() does. Good to know. Hrmm. Interestingly the NEWSEQUENTIALID() call in SQL Server doesn't correspond to any of the defined UUID or GUID formats (it seems to fall under "reserved for further expansion" in the GUID spec per Wikipedia) Probably because its weaker than any of them. ;)
Edward Kmett
+9  A: 

Try this: http://php.net/manual/en/function.uniqid.php

A: 

I don't have a formula but we do this on a project that I'm on. (I can't share it). But we basically generate one character at a time and append the string.

Once we have a completed string, we check it against the database. If there is no other, we go with it. If it is a duplicate, we start the process over. Not very complicated.

The advantage is, I guess that of a GUID.

Frank V
+1  A: 

Consider using something like:

$id = base64_encode(md5(uniqid(),true));

uniqid will get you a unique identifier. MD5 will diffuse it giving you a 128 bit result. Base 64 encoding that will give you 6 bits per character in an identifier suitable for use on the web, weighing in around 23 characters and computationally intractable to guess. If you want to be even more paranoid ugrade from md5 to sha1 or higher.

Edward Kmett
+1  A: 

Results of hash functions like SHA-1 or MD5 and GUIDs tend to become very long, which is probably something you don't want. (You've specifically mentioned YouTube as an example: Their identifiers stay relatively short even with the bazillion videos they are hosting.)

This is why you might want to look into converting your numeric IDs, which you are using behind the scenes, into another base when putting them into URLs. Flickr e.g. uses Base58 for their canonical short URLs. Details about this are available here: http://www.flickr.com/groups/api/discuss/72157616713786392/. If you are looking for a generic solution, have a look at the PEAR package Mathe_Basex.

Please note that even in another base, the IDs can still be predicted from outside of your application.

Martin Jansen
+1  A: 

If you want short URLs and predictability is not a concern, you can convert the auto-incrementing ID to a higher base.

Mark
+2  A: 

base62 or base64 encode your primary key's value then store it in another field.

example base62 for primary key 12443 = 3eH

saves some space, which is why im sure youtube is using it.

doing a base62(A-Za-z0-9) encode on your PK or unique identifier will prevent the overhead of having to check to see if the key already exists :)

Jason