views:

474

answers:

3

I'm adding a feature to my project where we are generating links to internal stuff of our website, and we want these links to be as short as possible, so we'll be making our own "URL Shortener".

I'm wondering what's the best encoding / alphabet to use for the generated short URLs. This is largely a subjective question, I'd like to know what your opinions are regarding the best approach / trade-off.

Several options I've thought of:
- Digits, uppercase + lowercase (base 62)
- Digits, only lowercase (base 36)
- Base 32 (http://www.crockford.com/wrmg/base32.html)
- linkpot.net (using common short english words)

Of course, the second two are better for uses other than clicking, and the first two are better for Twitter.

Also, if I'm going with "clickable-only" URLs, I'd like to make the alphabet as large as possible, adding other symbols.

  • What symbols can I use in URLs that won't get URL encoded?
  • What symbols should I use? Could some of these prove problematic? I'm thinking slash and dot, for example.

What do you think?

NOTE: The main target for these URLs is Twitter. Keeping this in mind, we should probably have the largest alphabet possible, since most people will be clicking. However, I'm interested in your experience with people using short URLs in other ways (over the phone, in printed paper, etc). How likely is it this could happen?

NOTE 2: I'm not making "yet another URL shortener", please don't condemn me with downvotes. We are generating short URLs for internal stuff in our site, not allowing anyone to shorten any URL. Imagine Google Maps giving you short URLs when you generate a link to a specific coordinate.

+2  A: 

If these are "clickable only URLS" I'd probably go with a base-64 encoding. MIME's base-64 uses a couple of characters you shouldn't use, but there are enough unreserved safe characters in URLs that you can just swap them out. (Also, you don't need the padding that MIME's base-64 uses, since you know when your URL ends.)

Here's a page that discusses one way to do this.

You can look at RFC2396 to figure out exactly what characters are safe in URIs if you want to double check.

Laurence Gonsalves
+2  A: 

I'd be curious to know a little more about the implementation. How will these URLs be "unshortened", or will the internal pages being accessed be saved as shortened URLs? In either case, even if you went with the encoding set of [A-Z] you'd be able to reference 26 * 26 * 26 = 17,576 pages with only 3 characters; how many internal web pages are you talking about?

In general I would lean on what your use case requirements are for picking the right encoding set. Are you planning on having these links available for "uses other than clicking"? What would those uses be, and how do you suspect they'll alter the encoding? (For example, using parts of the URL as a file name on a case-insensitive file system reduces the available character set.)

Here's an informative page on the character set you have available to you when writing a URL.

fbrereto
Thank you for your answer.Internally I will have "entities" created by users, which will have a unique integer ID.I will then expose these as the shortened URL just to make it shorter for twitter...So, you could have mydomain.com/1525343 or, mydomain.com/a4D, which would mean the same to me, but it'll be shorter.
Daniel Magliola
If these are going to be used by external clients, I'd lean more towards a simpler encoding range, like [0-9a-z]. I wouldn't include [A-Z] so users can manually enter URLs without worrying about upper/lowercase. Even with a 36-character range like that, you accomplish a tremendous amount of shortening. For example, 5 characters alone nets you 60,466,176 unique shortened URLs.
fbrereto
+2  A: 

I would go with Base-62, it's the shortest. Shortened URL is not meant for someone to manually enter anyway so don't worry about case-sensitivity.

ZZ Coder