views:

300

answers:

2

I am working in python on appengine.

I am trying to create what is equivalent to the "v" value in the youtube url's (http://www.youtube.com/watch?v=XhMN0wlITLk) for retrieving specific entities. The datastore auto generates a key but it is way too long (34 digits). I have experimented with hashlib to build my own, but again I get a long string. I would like to keep it to under 11 digits (I am not dealing with a huge number of entities) and letters and numbers are acceptable.

It seems like there should be a pretty standard solution. I am probably just missing it.

+6  A: 

You can use the auto generated integer id of the key to generate the hash. A simple way to generate the hash would be to convert the integer id to base62 (alphanumeric). To fetch the object simply convert to decimal back from base62 and use get_by_id to retrieve the object.

Here is a simple base62 conversion function that i have used in one of my apps.

import string
alphabet = string.letters + string.digits
max = 11

def int_to_base62(num):
    if num == 0:
        return alphabet[0]

    arr = []
    radix = len(alphabet)
    while num:
        arr.append(alphabet[num%radix])
        num /= radix
    arr.reverse()
    return (alphabet[0] * (max - len(arr))) + ''.join(arr)

def base62_to_int(str):
    radix = len(alphabet)
    power = len(str) - 1
    num = 0
    for char in str:
        num += alphabet.index(char) * (radix ** power)
        power -= 1
    return num
z33m
Those two links were very helpful. The problem now is finding the ideal way to encode and decode in base62. I have done some reading, is there a method you suggest?
LeRoy
you can use basic number base conversion techniques. To make the hash fixed length, just add some zero padding to the base62 number.
z33m
+1  A: 

If you have a value that is unique for every entity, you can get a shorter version by hashing it and truncating. Hashes like md5 or sha1 are well-mixed, meaning that every bit in the output has a 50% chance of flipping if you change one bit in the input. If you truncate the hash, you are simply increasing the odds of a collision, but you can make the tradeoff between length and collision odds.

Url-safe base64 encoding is a good option for turning the hash into text.

orig_id = 'weiowoeiwoeciw0eijw0eij029j20d232weifw0jiw0e20d2' # the original id
shorter_id = base64.urlsafe_b64encode(hashlib.md5(orig_id).digest())[:11]

With base64, you have 6 bits of information per character, 11 characters gives you 66 bits of uniqueness, or a 1 in 2**66 chance of collision.

Ned Batchelder
is there a reason you would choose base64 conversion over base62 like what is suggested above?
LeRoy
Base64 seems to always include a "=" which isn't really querystring safe.
LeRoy
I use base64 over base62 just because it's more familiar. The = is padding. You're truncating anyway, right?
Ned Batchelder