ansaurus

Question

Another ID generation queston

Answer 1

+1 A:

How about something like this:

If you want 4 character keys using a-zA-Z0-9, then you'd have: 62^4 = > 14 million possible values.
Break this up into N partitions: 0000 ... 1000, 1001 ... 2000 , ... , ZZAA ... ZZZZ

Each partition is represented by an entity with: start id end id current id

Now, to generate an ID:

randomly pick a partition. Use the start key for each partition as a key. Start Transaction:
get_or_create() that partition entity.
if current id == end id, go to step 1
save current id
increment current id by one End Transaction

use the current id that was saved as your id.

If you pick N as 140 you each partition would have 100,000 values. This would allow quite a few concurrent inserts while limiting the number of retries due to picking an "empty" partition.

You might need to think about this more. Especially, how will you handle the case when you need to move to 5 or 6 digit keys?

-Dave

Dave 2009-05-03 03:20:16

Thanks for the interesting approach. I'll give it some thought and try to better understand it. One question I have is how much it is how much it would result in collisions (or retries) as the number of keys grows large. I'm trying to keep collisions to as close to zero as possible.

Eric W. 2009-05-03 04:10:56

You will only run into collisions when the partitions fill up.

Dave 2009-05-03 18:38:16

There are other optimizations you can do with this:1. Memcache a list of "filled partitions"2. If you are going to get a bunch of ids at once, then you can grab a block of n ids from a partition and then increment its counter by that value.

Dave 2009-05-03 18:42:22

What happens when two server instances request a new id from the same partition at the same time? Would they get the same id?

Eric W. 2009-05-06 15:48:07

No, because you'll access the partition in a transaction. So if two try at the same time one of them will fail and retry.

Dave 2009-05-08 19:03:55

As I read the question I was about to suggest this approach, but you beat me to it. I recommend this solution. Among other things, it should result in shorter IDs than the original poster's suggestion.

mcherm 2009-07-01 21:29:24

Answer 2

+2 A:

Just to add some hard numbers to the question above, I've implemented a small program to generate ids along the lines mentioned in the question, and these were the results of one of the trial runs:

Length     Count      MD5             Base 62
4          400        3d0e            446
5          925        4fc04           1Myi
6          2434       0e9368          40V6
7          29155      e406d96         GBFiA
8          130615     7ba787c8        2GOiCm
9          403040     75525d163       YNKjL9
10         1302992    e1b3581f52      H47JAIs
Total:     1869561

Each time the length of the hash increased, this was due to a collision, so there were six collisions in this instance. The count is the number of keys of a given length that were generated before a collision. The MD5 column shows the last truncated key that was successfully added before there was a duplicate key error. The column on the far right shows the key in base 62 (if I've done the conversion correctly).

It looks like more and more keys are being generated with each additional character, which is what you would imagine. I'm hopeful that this approach will scale for user-generated content.

For anyone who's interested, here's how I got the base 62 representation for the last column:

def base_62_encode(input):
    "Inspired by code at http://en.wikipedia.org/wiki/Base_36."
    CLIST="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    rv = ""
    while input != 0:
        rv = CLIST[input % 62] + str(rv)
        input /= 62
    return rv

base62_id = base_62_encode(long(truncated_hash, 16))

(Added later on:)

Here is a class that takes care of several things related to the generation of these IDs in the context of Google App Engine. By default, keys that are generated by this class are not purely base 62, since the first character of a GAE key name must be alphabetic. That requirement has been addressed here by using base 52 for the first character.

The primary base can be changed to something other than 62 by altering the "clist" value that has been passed into (or omitted from) the constructor. One might want to remove characters that are easy to get mixed up, for example, "1", "l", "i", etc.

Usage:

keygen = LengthBackoffIdGenerator(SomeClass, initial_length=5)
small_id, modified = keygen.small_id(seed_value_from_sharded_counter)

Here is the class:

class LengthBackoffIdGenerator(object):
    """Class to generate ids along the lines of tinyurl.

    By default, ids begin with a alphabetic character.  Each id that is created is checked against previous ones
    to ensure uniqueness.  When a duplicate is generated, the length of the seed value used is increased by one
    character.

    Ids become gradually longer over time while remaining relatively short, and there are very few retries in 
    relation to the number of keys generated.
    """
    ids = set()

    def __init__(self, cls, clist='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', alpha_first=False,
            initial_length=5, check_db=False):
        self.clist = clist
        self.alpha_first = alpha_first
        if self.alpha_first:
            import re
            alpha_list = re.sub(r'\d', '', clist)
            if len(alpha_list) < 1:
                raise Exception, 'At least one non-numeric character is required.'
            self.alpha_list = alpha_list
        self.cls = cls
        self.length = initial_length
        self.check_db = check_db

    def divide_long_and_convert(self, radix, clist, input, rv):
        "Inspired by code at http://en.wikipedia.org/wiki/Base_36."
        rv += clist[input % radix]
        input /= radix
        return (input, rv)

    def convert_long(self, input):
        rv = ""
        if self.alpha_first:
            input, rv = self.divide_long_and_convert(len(self.alpha_list), self.alpha_list, input, rv)
        while input != 0:
            input, rv = self.divide_long_and_convert(len(self.clist),      self.clist,      input, rv)
        return rv

    def hash_truncate_and_binify(self, input, length):
        """Compute the MD5 hash of an input string, truncate it and convert it to a long.

        input  - A seed value.
        length - The length to truncate the MD5 hash to.
        """
        from assessment.core.functions import md5_hash
        input = md5_hash(input)[0:length]
        return long(input, 16)

    def _small_id(self, input):
        """Create an ID that looks similar to a tinyurl ID.

        The first letter of the ID will be non-numeric by default.
        """
        return self.convert_long(self.hash_truncate_and_binify(input, self.length))

    def small_id(self, seed):
        key_name = self._small_id(seed)
        increased_length = False
        if self.check_db and not self.cls.get_by_key_name(key_name) is None:
            self.ids.add(key_name)
        if key_name in self.ids:
            self.length += 1
            increased_length = True
            key_name = self._small_id(seed)
        self.ids.add(key_name)
        return (key_name, increased_length)

Eric W. 2009-05-03 04:25:27

Answer 3

+1 A:

This algorithm is smart and would indeed minimize write operations. So the length value will converge to a balance between shortest length and absence of collisions.

You can relax the constrain of absence of collision by using strategies used in hash tables. Try some other unique IDs before falling back to incrementing the length.

I would thus suggest you add a test counter to the end of the hashed string initialized to 0. If the generated ID is already used, increment the counter and retry until a maximum counter value is reached after what you increase the length.

You will end up with a more efficient use of your ID address space and much less frequent length increments.

Regarding your question about the length limit of MD5 I think the choice of MD5 is an overkill. You don't really need a cryptographic (pseudo)secure hash. What you need is a random bit generator for which you can use crc32 (or adler which is faster). Especially if the code is to be run on a mobile phone. To implement a random bit generator with crc32, prepend a 32bits value to the string to hash and initialize it with a constant value of your choice (the seed). Then compute the crc32 on this string. If you need more bytes, write the obtained crc32 value in the 32bits in front of the string and recompute the crc32. Iterate until you have enough bits. You may replace the crc32 with the algorithm of your choice. This yields a cheap and fast random bit generator where the initial constant is the seed.

With this algorithm you minimize the number of random bits generated and also have no upper limit in length.

Regarding encoding, you didn't specify the constrains. Can you use upper and lower case letters with the digits ? Your example uses an alphabet of 36 different ASCII values. Once you have the pseudo random number generator described above that can generate as many bytes as desired, simply define length to be the number of letters of your ID and pick the next letter with a modulo of the next random byte. You'll know then exactly how many bytes to generate in one go and generating the ID is trivial.

chmike 2009-05-08 09:40:46

ansaurus

tags:

views:

answers:

Another ID generation queston

related questions