views:

4471

answers:

10

I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef". Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56 trillion possible strings.

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:

short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.

That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?

+4  A: 

Fellow StackOverflow user Michael Stum has made a podcast video of how to create an URL shortening service in his blog. Check it out.

Konrad Rudolph
Thanks! But he just uses the auto-increment value from the database table. I want to replace the id (number) by a-z, A-Z and 0-9.
Yes, that was supposed to be part of a later series. I'll completely rewrite the project soon as a Proof-Of-Concept application. The Base36 function to convert an ID into a string would be here: http://www.stum.de/2008/10/20/base36-encoderdecoder-in-c/
Michael Stum
+1  A: 

Why do you have to save a hash at all, when you could just convert the id to base64/hex/whatever/don't convert it at all?

tstenner
+11  A: 

Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z,a-z,0-9 etc') has 40 characters, convert the id to a base-40 number and use the characters are the digits.

shoosh
asides from the fact that A-Z, a-z and 0-9 = 62 chars, not 40, you are right on the mark.
Evan Teran
Thanks! Should I use the base-62 alphabet then?http://en.wikipedia.org/wiki/Base_62But how can I convert the ids to a base-62 number?
Using a base conversion algorithm ofcourse - http://en.wikipedia.org/wiki/Base_conversion#Change_of_radix
shoosh
Thank you! That's really simple. :) Do I have to do this until the dividend is 0? Will the dividend always be 0 at some point?
yes, if you keep dividing an integer n by k, it will reach 0 after a maximum of log_k(n)+1 divisions (try to prove this)
shoosh
Regarding "Why would you want to use a hash?", a base conversion based on auto-increment is going to create sequential URLs, so you'd need to be comfortable with people being able to "browse" other peoples' shortened URLs, right?
Andrew Coleson
Yes, but that's no problem, is it? I don't want to encrypt the URLs but I want to shorten them.
with enough resources and time you can "browse" all the URLs of of any URL shortening service.
shoosh
+1  A: 

Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.

cr333
+27  A: 
Marcel J.
Thank you very much for this detailed description. shoosh mentioned the base conversion, too. I think this is a very good way.My first urls will have 2 characters, won't they? Then they will successively become longer!? Using this method, I won't need more than 6 characters for 56 trillion URLs?
Your first 62 urls will have only one character. You could however (if you wanted to) just always add (and on resolve: substract) a constant value to make them longer. And yes, 6 chars will indeed cover 56 billion (not trillion) URLs. But I wonder how big tinyURLs table is. ;)
Marcel J.
TinyURL has 200 million URLs which they need only two columns in the database for. One column (mysql) could be VARCHAR(6) with each 7 bytes and the other one could be bigint with each 8 bytes. That makes 15 bytes per entry. So you have 2.8GB without the index. ;)
Don't forget to sanitize the URLs for malicious javascript code! Remember that javascript can be base64 encoded in a URL so just searching for 'javascript' isn't good enough.j
apphacker
@apphacker: Could you shortly explain how to sanitize please? I thought it would be enough to use strip_tags() function in PHP.Or tell me if that cannot be explained shortly, then I'll post it as a new question here.
Could someone explain how to implement this in PHP? (just the conversion)
abrahamvegh
A function must be bijective (injective *and* surjective) to have an inverse.
Gumbo
@Gumbo: You are right. An injective function only implies a left inverse, but no inverse. Fixed that.
Marcel J.
Food for thought, it might be useful to add a two character checksum to the url. That would prevent direct iteration of all the urls in your system. Something simple like f(checksum(id) % (62^2)) + f(id) = url_id
koblas
+2  A: 

If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/

Alister Bulman
"Sorry, it looks like spammers got to this. Try tinyurl instead."
takeshin
to the demo site. The source code is still downloadable from Sourceforge.
Alister Bulman
+5  A: 

Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.

Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.

Ash
Thank you, very good idea. I haven't thought about that yet. It's clear that it depends on the kind of use whether that makes sense or not.
+2  A: 

My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.

Michael Stum
Thanks, you're right. Whether you have 2,176,782,336 possibilities or 56,800,235,584, it's the same: Both will be enough. So I will use base 36 encoding.
It may be obvious but here is some PHP code referenced in wikipedia to do base64 encode in php http://www.tonymarston.net/php-mysql/converter.html
Ryan
+1  A: 
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))

def lookup(k, a=alphabet):
    if type(k) == int:
        return a[k]
    elif type(k) == str:
        return a.index(k)


def encode(i, a=alphabet):
    '''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
    try:
        i = int(i)
    except Exception:
        raise TypeError("Input must be an integer.")

    def incode(i=i, p=1, a=a):
        # Here to protect p.                                                                                                                                                                                                                
        if i <= 61:
            return lookup(i)

        else:
            pval = pow(62,p)
            nval = i/pval
            remainder = i % pval
            if nval <= 61:
                return lookup(nval) + incode(i % pval)
            else:
                return incode(i, p+1)

    return incode()



def decode(s, a=alphabet):
    '''Takes a base 62 string in our alphabet and returns it in base10.'''
    try:
        s = str(s)
    except Exception:
        raise TypeError("Input must be a string.")

    return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a

Here's my version for whomever needs it.

MrChrisRodriguez
+1  A: 

You could use the encoding alg I used for my Google App Engine URL Shortener here - http://i.loo.lu/W

designcurve