views:

942

answers:

8

I would like to generate a short, unique ID without having to check for collisions.

I currently do something like this, but the ID I currently generate is random and checking for collisions in a loop is annoying and will get expensive if the number of records grows significantly.

Normally worrying about collisions isn't an issue, but the unique ID I want to generate is a short unique string 5-8 characters, alpha-numeric, like tinyurl does.

EDIT: I would like to start out with 5 characters and if I hit 60 million entries, then go to 6.. so on and so forth.

To this end, I was thinking I could use an auto_increment value that is hidden from the users, and present them instead with an MD5 or some other method to generate a unique string from that.

Generated strings should not appear to be linear, so simply converting the auto_incremented ID into base 36 [0-9A-Z] is a bit too simplistic, but a function something like that is where I'm going with this.

EDIT: Security is not an issue as this will not be used to secure information. It is simply a shortcut to a longer string. Thank you.

Thank you for your suggestions and sorry for the delay. Dentist..

A: 

I think this will never be really secure, as you only need to find the encryption method behind the short unique string to hijack an ID. Is checking for collisions in a loop really that problematic in your setting?

Pekka
Not right now, but at at 50 million links and 5 characters, something like 80% become collisions.
Daren Schwenke
A: 

An MD5 of an incrementing number should be fine, but I worry that if you're truncating your MD5 (which is normally 128 bits) down to 5-8 characters, you will almost certainly be damaging it's capability to act as a unique signature...

dicroce
MD5 (or any hash for that matter) can never act as a unique signature. So your trunication concerns are a bit moot.
Kevin Peno
Hash's are used as signatures all the time...
dicroce
I'm not sure how that negates my response. People do dumb stuff all the time, that doesn't make hashes magically become "unique".
Kevin Peno
So, because collisions are even possible then it's useless as a signature?
dicroce
+1  A: 

You could probably generate a MD5 hash of the current datetime/random number and truncate it to the length you need (5-8 characters) and store it as the id field.

If you are using storing this information in a database, you don't need to use a for loop to do the collision check, but you could just do a select statement - something like

SELECT count(1) c FROM Table WHERE id = :id

where :id would be the newly generated id. If c is greater than 0 then you know it already exists.

EDIT

This may may not be the best way to go about it. But I'll give it a shot, so I guess what you need is someway of converting a numbers into a unique short string and that is not in sequence.

I guess as you said, base64 encoding already does the number to short string conversion. To avoid the sequence problem you could have some mapping between your auto-generated id's to some "random" value (unique mapping). Then you can base64 encode this unique value.

You could generate this mapping as follows. Have a temporary table store values from 1 - 10,000,000. Sort it in random order and store it into you Map table.

INSERT INTO MappingTable (mappedId) SELECT values FROM TemporaryTable ORDER BY RAND()

Where MappingTable would have the 2 fields id (your auto-generated id would look up against this) and mappedId (which is what you would generate the base64 encoding for).

As you get closer to 10,000,000 you could rerun the above code again and change the values in the temporary table with 10,000,001-20,000,000 or something like that.

aip.cd.aish
The loop originates to generate another unique id if I fail. I know this reeks of premature optimization, but I'm hoping there is a better way.
Daren Schwenke
A: 

An MD5 of an incrementing number should be fine, but I worry that if you're truncating your MD5 (which is normally 128 bits) down to 5-8 characters, you will almost certainly be damaging it's capability to act as a unique signature...

Completely true. Especially if you reach your 80% collision chance a truncated MD5 will be as good as any random number to guarantee uniqueness by itself, i.e. worthless.

But since you're using a database anyway, why not just use a UNIQUE INDEX ? This way the uniquness check is done (in a much more efficient way than using a loop) by MySQL itself. Just try to do the INSERT with your MD5-generated key, and if it fails, try again...

Wim
Yep, but I'm still left to recompute the random and retry the insert. That's what I do now. I was looking for a way to automatically ensure I'm unique on the first try. The base 36 method would do this on the first try, but the first link would be 00000, second 00001 and so on.
Daren Schwenke
Perhaps an upvote of the comment you quoted is in order? :)
dicroce
Sure, once I have enough reputation... ;)
Wim
+1 for the suggestion of using a SQL unique index, but -1 for continuing to suggest MD5. If going to use a psudorandom number that has no 100% unique value, you may as well use PHP's uniqid as the unique key (if not primary key) or use auto increment mixed with an external option (like base64encode) to "mask" the id.
Kevin Peno
Sure, neither of these can guarantee uniqueness by itself. You really need a permutation on a limit range, see my other answer about that.
Wim
Yes you're right. I shouldn't have really posted this as an answer, but I couldn't comment on the original one. 13 points left to go... ;-)
Wim
I like your other answer, but I was commenting on this one ;) (sorry for the jump down, I edited this). You are on your way to commenting for sure :P
Kevin Peno
+4  A: 

You'll need something that's correct by construction, i.e. a permutation function: this is a function that does a one-to-one, reversible mapping of one integer (your sequential counter) to another. Some examples (any combination of these should also work):

  • inverting some of the bits (f.i. using an XOR, ^ in PHP)
  • swapping the places of bits (($i & 0xc) >> 2 | ($i & 0x3) << 2), or just reversing the order of all bits
  • adding a constant value modulo your maximum range (must be a factor of two, if you're combining this with the ones above)

Example: this function will convert 0, 1, 2, 3, 5, .. into 13, 4, 12, 7, 15, .. for numbers up to 15:

$i=($input+97) & 0xf;
$result=((($i&0x1) << 3) + (($i&0xe) >> 1)) ^ 0x5;

EDIT

An easier way would to use a linear congruential generator (LCG, which is usually used for generating random numbers), which is defined by a formula of the form:

X_n+1 = (a * X_n + c) mod m

For good values of a, c and m, the sequence of X_0, X_1 .. X_m-1 will contain all numbers between 0 and m-1 exactly once. Now you can start from a linearly increasing index, and use the next value in the LCG sequence as your "secret" key.

EDIT2

Implementation: You can design your own LCG parameters, but if you get it wrong it won't cover the full range (and thus have duplicates) so I'll use a published and tried set of parameters here from this paper:

a = 16807, c = 0, m = 2147483647

This gives you a range of 2**31. With pack() you can get the resulting integer as a string, base64_encode() makes it a readable string (of up to 6 significant characters, 6 bits per byte) so this could be your function:

substr(base64_encode(pack("l", (16807 * $index) % 2147483647)), 0, 6)
Wim
I like this idea. Got some code for me? 1-60000000 mapping to a 5 character alpha numeric string?
Daren Schwenke
Sorry I'm not trusting my maths at this time of day so it's a 31-bit one. For a 30-bit one you'd have exactly 5 (significant, i.e. without the padding ='s) characters after base64_encode but I couldn't find a parameter set for that.
Wim
I think this one may work for a 30-bit one:a = 357913942, c = 1, m = 1073741823(If I'm not mistaken, it fulfills the criteria in the Wikipedia article for it to have a full period: m = 2**31-1 = 3**2 * 7 * 11 * 31 * 151 * 331, c = 1 so it's obviously coprime with m, and a-1 = 3 * 7 * 11 * 31 * 151 * 331 which is divisible by all prime factors of m...)
Wim
All I read is math, math, math.. :) I'll generate some test cases, well.. alot of test cases this weekend. Thank you!
Daren Schwenke
+1  A: 

you can use a bitwise XOR to scramble some of the bits:

select thefield ^ 377 from thetable;

+-----+---------+
| a   | a ^ 377 |
+-----+---------+
| 154 |     483 |
| 152 |     481 |
|  69 |     316 |
|  35 |     346 |
|  72 |     305 |
| 139 |     498 |
|  96 |     281 |
|  31 |     358 |
|  11 |     370 |
| 127 |     262 |
+-----+---------+
longneck
Yes, that would seem to work as well.
Daren Schwenke
A: 

If you cannot use an auto increment field, and want an absolutely unique value, use UUID. If you decide to use anything else (besides auto increment), you would be silly to NOT check for collisions.

Kevin Peno
A: 

This blog post has something close to what you are after.

http://kevin.vanzonneveld.net/techblog/article/create_short_ids_with_php_like_youtube_or_tinyurl/

bender