views:

832

answers:

8

I want to design a URL shortener for a particular use case and type of end-user that I have targetted. I have decided that I want the URLs to be stored internally according to an auto-incrementing integer key. However, it is also required that a key be represented to users in the url as six-digit base 26 (a-z * 6) AND it's not possible to predict what the base 26 url key is based on the incrementing integer key. In other words, the first url key should not be aaaaaa then the next time someone creates a url it shouldn't be aaaaab, etc, and no loop generating a random one and fishing into the database to see if it already exists repeatedly.

The second part of the requirements (urls in base 26 difficult for an outsider to predict) is the more interesting part. Ideally I would like some kind of algorithmic 1-1 mapping of all of the numbers in the 26^6 range to another number in the same range that I can just then print in base 26, and that I can undo algorithmically and don't need to store in a separate table when I want to look up the url. How can I accomplish this?

A: 

It depends what you mean by unpredictable. If you want cryptographically secure, you might be interested in the Blum Blum Shub algorithm, but you probably don't.

I have implemented a linear feedback shift register for the purpose of giving random looking unique identifiers. LFSRs are simple to implement and they cycle through all possible combinations, although it is possible to calculate the next number given the previous number (it isn't straight forward, but it can be done).

I'm not sure how to use the entire 26^6 space if you use a LFSR. A LFRS is of a certain bit length and cycles through every possible combination of those bits (except 00...0 I think). You could use a 28 bit LFSR, but you would lose the top 40 million combinations (which is about 13% of them).

It appears that it is possible to map the states of the LFSR with ordinals (i.e., the nth state of the LFSR is x), but there's a patent on it... But you want to go in reverse anyway.

David Johnstone
There may be a patent on a specific method to find the n-th state of an LFSR, but it is long known how to compute the n-th state efficiently. E.g., books like "Finite fields and their applications" by Lidl and Niederreiter discuss this topic.
Accipitridae
+1  A: 

How about an LFSR? The linear feedback shift register is used to generate pseudo-random numbers in a range - the operation is deterministic given the seed value, but it can visit every value in a range with a long cycle.

1800 INFORMATION
what does it take as input? The preceding number that it generated? I would like something that takes an incrementing number as input, so that way if I can use that number as a clustered primary key and always write to the end of the table when I add rows.
You could feed your incrementing number if you use a maximum length LFSR (i.e. one with no repeats). In that case I'd do a few (say 10) rounds so you hit a few taps. It's not crypto strong, but it seems that's not your requirement.
peterchen
+5  A: 

Using a Hash function with a seed should make it unpredictable.
Security is obviously not an issue (else you would use cryptography).

Actually, you could straight-away use MD5 and select a fixed 6 characters for a simple solution that will work well. It is available in most languages and generates an alphanumeric hash a 128-bit hash that is easily written as 32 hexadecimals. That's actually just 16 characters (reduces to base 16).

Cooking up your own algorithm for unpredictable hashing is not advised.
Here is a Coding Horror blog entry you should read too.


I am blatantly double quoting from Jeff's Coding Horror reference to emphasize.

Suppose you're using something like MD5 (the GOD of HASH). MD5 takes any length string of input bytes and outputs 128 bits. The bits are consistently random, based on the input string. If you send the same string in twice, you'll get the exact same random 16 bytes coming out. But if you make even a tiny change to the input string -- even a single bit change -- you'll get a completely different output hash.

So when do you need to worry about collisions? The working rule-of-thumb here comes from the birthday paradox. Basically you can expect to see the first collision after hashing 2n/2 items, or 2^64 for MD5.

2^64 is a big number. If there are 100 billion urls on the web, and we MD5'd them all, would we see a collision? Well no, since 100,000,000,000 is way less than 2^64:

2^64 18,446,744,073,709,551,616
2^37 100,000,000,000


Update based on comments.

  • With a 6 character hexadecimal representation like I suggest above, the probability of collisions reduces to 2^12 -- which is just 4096! (read the whole Coding Horror article for the nuances).
  • If you do not want repeatability in your shortening (same shortened form for a URL every time), a random number should be fine.
nik
Hash functions are deterministic, and in this particular use case have many disadvantages and no advantages over a simple random number generator.
Zarel
Hash functions used with a seed are NOT predictable.
nik
...yes, but you still haven't listed a single benefit they have over a random number generator for this use case. RNGs are faster, too.
Zarel
Also, MD5 generates a 128-bit binary hash (as your quote states), not an alphanumeric hash. You may have seen its output in hexadecimal form and mistaken it for alphanumeric, but it is not.
Zarel
Yes Zarel, I think i need to give this some more thought. For some reason, I was biased towards having the shortening algorithm repeatable (same short form for the URL again and again). That is probably not required.
nik
The alphanumeric hash can be easily cut to required size -- was my approach.
nik
...my point is that MD5 is not an alphanumeric hash.
Zarel
Cutting to size, you will raise the collision rate to possibly unacceptable levels. The algorithm requested had no loop checking if the value was already generated.
Aaron
A: 

You want to permute your initial autoincrementing ID number with a Feistel network. This message (which happens to be on the PostgreSQL lists but doesn't really have much to do with PostgreSQL) describes a simple Feistel network. There are, of course, plenty of variations, but in general this is the Right Approach.

kquinn
Does it need to work in a range that has 4 or 8 or something like that as a factor? I seem to recall having problems with something like that
I'm not an expert on Feistel and SP networks (or I'd have posted a more thoughtful answer!), but I think all you need to run one is a set of bits. If you design a network to operate on chunks of 28 bits at a time, it will operate as a permutation on the set of 28-bit binary numbers, and all outputs will be < 2**28. Then you can just print them out in base 26 and be done. (I chose 28 bits since x = 28 is the largest x satisfying 2**x <= 26**6; if you allow 7-character IDs, you can go all the way to 32 bits, or 31 for 6 base-36 [alphanumeric] characters.)
kquinn
A: 

26^6 is around 300 million.

Easiest just to use a random number generator, and if you have a collision (i.e. in case your randomly generated 6-letter identifier is already taken), increment until you have a free identifier.

I mean, sure, you'll get collisions fairly early (at around 17 thousand entries), but incrementing until you have a free identifier will be plenty fast, at least until your keyspace starts to be saturated (around 12 million entries), and by then, you should be switching to 7-letter identifiers anyway.

Zarel
again, the original request was for a scheme that didn't require checking for collisions.
Aaron
A: 

You need a Block Cipher with "Block Space" of 266.

Choose an arbitrary key for the cipher, and you now have a transformation that is reversible by you, yet unpredictable for everyone else.

Your block size is a bit unusual, so you probably won't find a ready-made good block cipher for your size. But as suggested by kquinn you can design one on your own that mimics other ciphers.

yairchu
+3  A: 

Why not just shuffle the bits around in a specific order before converting to the base 26 value? For example, bit 0 becomes bit 5, bit 1 becomes bit 2, etc. To decode, just do the reverse.

Here's an example in Python. (Edited now to include converting the base too.)

import random

# generate a random bit order
# you'll need to save this mapping permanently, perhaps just hardcode it
# map how ever many bits you need to represent your integer space
mapping = range(28)
mapping.reverse()
#random.shuffle(mapping)

# alphabet for changing from base 10
chars = 'abcdefghijklmnopqrstuvwxyz'

# shuffle the bits
def encode(n):
    result = 0
    for i, b in enumerate(mapping):
        b1 = 1 << i
        b2 = 1 << mapping[i]
        if n & b1:
            result |= b2
    return result

# unshuffle the bits
def decode(n):
    result = 0
    for i, b in enumerate(mapping):
        b1 = 1 << i
        b2 = 1 << mapping[i]
        if n & b2:
            result |= b1
    return result

# change the base
def enbase(x):
    n = len(chars)
    if x < n:
        return chars[x]
    return enbase(x/n) + chars[x%n]

# go back to base 10
def debase(x):
    n = len(chars)
    result = 0
    for i, c in enumerate(reversed(x)):
        result += chars.index(c) * (n**i)
    return result

# test it out
for a in range(200):
    b = encode(a)
    c = enbase(b)
    d = debase(c)
    e = decode(d)
    while len(c) < 7:
        c = ' ' + c
    print '%6d %6d %s %6d %6d' % (a, b, c, d, e)

The output of this script, showing the encoding and decoding process:

   0            0       a            0    0
   1    134217728  lhskyi    134217728    1
   2     67108864  fqwfme     67108864    2
   3    201326592  qyoqkm    201326592    3
   4     33554432  cvlctc     33554432    4
   5    167772160  oddnrk    167772160    5
   6    100663296  imhifg    100663296    6
   7    234881024  ttztdo    234881024    7
   8     16777216  bksojo     16777216    8
   9    150994944  mskzhw    150994944    9
  10     83886080  hbotvs     83886080   10
  11    218103808  sjheua    218103808   11
  12     50331648  egdrcq     50331648   12
  13    184549376  pnwcay    184549376   13
  14    117440512  jwzwou    117440512   14
  15    251658240  veshnc    251658240   15
  16      8388608   sjheu      8388608   16
  17    142606336  mabsdc    142606336   17
  18     75497472  gjfmqy     75497472   18
  19    209715200  rqxxpg    209715200   19

Note that zero maps to zero, but you can just skip that number.

This is simple, efficient and should be good enough for your purposes. If you really needed something secure I obviously would not recommend this. It's basically a naive block cipher. There won't be any collisions.

Probably best to make sure that bit N doesn't ever map to bit N (no change) and probably best if some low bits in the input get mapped to higher bits in the output, in general. In other words, you may want to generate the mapping by hand. In fact, a decent mapping would be simply reversing the bit order. (That's what I did for the sample output above.)

FogleBird
does this require that the range has to be a power of two?
To an extent. It operates on N bits. You want 26**6 values (6 lowercase letters) which falls between 2**28 and 2**29. You can either use 28 bits and be done with it or use 29 bits and skip the ones that result in a 7-letter code. 28 bits will get you 268 million values instead of the 308 possible values. Also note that you can pad the codes that are *shorter* than 6 characters with leading 'a' characters, as that's the same as padding with '0' in base 10. Anyway, keep it simple dude... I gave you a complete working implementation.
FogleBird
In other words, if you're worried about losing a batch of codes between that 268 million and 308 million mark, you have bigger things to worry about, as 6-letter codes may not be enough in the long run. I can't imagine 308 being acceptable while 268 is not. Please clarify further if it is.
FogleBird
Have you given up on this question?
FogleBird
A: 

I asked basically the same question recently, and the solution was to increment by a prime number (modulo max) to get a nice seemingly random sequencing w/o repeating any numbers: http://stackoverflow.com/questions/1257825/tinyurl-style-unique-code-potential-algorithm-to-prevent-collisions

Dan Breen