ansaurus

Question

How do I create unique IDs, like YouTube?

Answer 1

+2 A:

Auto-incrementing can easily be crawled. These cannot be predicted, and therefore cannot be sequentially crawled.

I suggest going with a double-url format (Similar to the SO URLs):

yoursite.com/video_idkey/url_friendly_video_title

If you required both the id, and the title in the url, you could then use simple numbers like 0001, 0002, 0003, etc.

Generating these keys can be really simple. You could use the uniqid() function in PHP to generate 13 chars, or 23 with more entropy.

Jonathan Sampson 2009-07-02 19:03:18

That's just the thing though... SO's URL scheme doesn't require the title at all, look: http://stackoverflow.com/questions/1076110/

John T 2009-07-02 19:11:13

If crawling is your concern, you can force it to require both parts.

Mark 2009-07-02 19:12:53

I know - you don't have to require it either. But if you want to keep them from crawling 1-1,000,000, require the title too.

Jonathan Sampson 2009-07-02 19:55:04

Answer 2

+2 A:

A way to do it is by a hash function with unique input every time.

example (you've tagged the question with php therfore):

$uniqueID = null
do {
  $uniqueID = sha1( $fileName + date() );
} while ( !isUnique($uniqueID) )

dimitris mistriotis 2009-07-02 19:03:37

Not *guaranteed* unique :) Very unlikely to have a collision though. A complete solution would at least check for collisions, and regenerate a new ID if there is one though.

Mark 2009-07-02 19:14:14

you are completly correct I had it implemented for an application recently, and forgot to write it here!Let's refactor the anwser...

dimitris mistriotis 2009-07-02 19:28:24

Very unlikely in this case being that it takes an intractable amount of computation to find two that collide at all. Even assuming 2^63 hashes or so would be required to get a 50% chance of collision by granting that cryptanalytic attacks model trivial redundancy in the hash function. The numbers are huge: 9.22337204 × 10^18, so if you generate ~ 4 billion keys with this you still only have a 1 in 4 billion chance of having a collision between them. And if you don't grant that the cryptographic vulnerabilities directly increase collision chance then those odds are much more in your favor.

Edward Kmett 2009-07-02 19:51:17

@Edward: Still worth the extra two lines of code :)@dimitris: you mean while !unique, don't you?

Mark 2009-07-02 22:14:08

@Mark: of course

dimitris mistriotis 2009-07-02 23:45:15

@Mark: right up until you are generating this in a distributed system and only have snapshot consistency guarantees because you can't keep up with the load, I agree. ;)

Edward Kmett 2009-07-03 00:46:53

Answer 3

+1 A:

There should be a library for PHP to generate these IDs. If not, it's not difficult to implement it.

The advantage is that later you won't have name conflicts, when you try to reorganize or merge different server resources. With numeric ids you would have to change some of them to resolve conflicts and that will result in Url change leading to SEO hit.

User 2009-07-02 19:03:43

-1 for being too lazy to google "php guid" and find the appropriate function, +2 for mentioning server conflicts. I'm sure google uses some sort of server cloud, could be a big reason why they use IDs like that on youtube.

Mark 2009-07-02 19:16:52

Mark: actually I wouldn't necessarily recommend GUIDs. They are designed to be collision free, but not to be hard to predict, so to meet all the desired characteristics of these identifiers you may want something stronger.

Edward Kmett 2009-07-02 19:39:14

I thought Mastermind was referring to GUIDs in his first sentence. Didn't mean that they are a good solution :D

Mark 2009-07-02 22:16:07

Answer 4

+1 A:

So much of this depends on what you need to do. How 'unique' is unique? Are you serving up the unique ID's, and do they mean something in your DB? if so, a sequential # might be ok.

ON the other hand, if you use sequential #'s someone could systematically steal your content by iterating thru the numbers.

There are filesystem commands that will generate unique file names - you could use those.

Or GUID's.

n8wrl 2009-07-02 19:03:55

+1 for guids. They're awesome

Michael Haren 2009-07-02 19:05:04

GUIDs work well if you don't require that they be unpredictable. But since a large part of their non-collision guarantee comes from well known portions of them coming from MAC addresses, etc. you explicitly can't rely on them not having a guessable progression.

Edward Kmett 2009-07-02 19:41:35

@Edward Kmett: UUIDs (of which GUIDs are a specific subset's implementation) can use several different algorithms, only one of them use MAC addresses or other well-known sources. there's even one where almost all bits are from random sources.

Javier 2009-07-02 19:53:54

Try version 4 UUIDs which are randomly generated. http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Version_4_.28random.29

Craig McQueen 2009-07-02 23:54:54

Ah good point. 122 bits of random goodness. I can't complain if you're sure what you're using to get UUIDs are generating randomly generated ones, which I just realized that Microsoft's NewGuid() does. Good to know. Hrmm. Interestingly the NEWSEQUENTIALID() call in SQL Server doesn't correspond to any of the defined UUID or GUID formats (it seems to fall under "reserved for further expansion" in the GUID spec per Wikipedia) Probably because its weaker than any of them. ;)

Edward Kmett 2009-07-03 01:05:21

Answer 5

+9 A:

Try this: http://php.net/manual/en/function.uniqid.php

2009-07-02 19:03:57

Answer 6

A:

I don't have a formula but we do this on a project that I'm on. (I can't share it). But we basically generate one character at a time and append the string.

Once we have a completed string, we check it against the database. If there is no other, we go with it. If it is a duplicate, we start the process over. Not very complicated.

The advantage is, I guess that of a GUID.

Frank V 2009-07-02 19:04:30

Answer 7

+1 A:

Consider using something like:

$id = base64_encode(md5(uniqid(),true));

uniqid will get you a unique identifier. MD5 will diffuse it giving you a 128 bit result. Base 64 encoding that will give you 6 bits per character in an identifier suitable for use on the web, weighing in around 23 characters and computationally intractable to guess. If you want to be even more paranoid ugrade from md5 to sha1 or higher.

Edward Kmett 2009-07-02 19:56:27

Answer 8

+1 A:

Results of hash functions like SHA-1 or MD5 and GUIDs tend to become very long, which is probably something you don't want. (You've specifically mentioned YouTube as an example: Their identifiers stay relatively short even with the bazillion videos they are hosting.)

This is why you might want to look into converting your numeric IDs, which you are using behind the scenes, into another base when putting them into URLs. Flickr e.g. uses Base58 for their canonical short URLs. Details about this are available here: http://www.flickr.com/groups/api/discuss/72157616713786392/. If you are looking for a generic solution, have a look at the PEAR package Mathe_Basex.

Please note that even in another base, the IDs can still be predicted from outside of your application.

Martin Jansen 2009-07-02 20:12:46

Answer 9

+1 A:

If you want short URLs and predictability is not a concern, you can convert the auto-incrementing ID to a higher base.

Mark 2009-07-02 22:21:37

Answer 10

+2 A:

base62 or base64 encode your primary key's value then store it in another field.

example base62 for primary key 12443 = 3eH

saves some space, which is why im sure youtube is using it.

doing a base62(A-Za-z0-9) encode on your PK or unique identifier will prevent the overhead of having to check to see if the key already exists :)

Jason 2009-07-03 07:34:51

ansaurus

tags:

views:

answers:

How do I create unique IDs, like YouTube?

related questions