I'm trying to generate unique IDs for use in a Google App Engine application and would like feedback on the feasibility of the approach I'm thinking of using (questions at the end). I've read quite a few questions on this topic, but I don't remember coming across this particular approach.
I'd like random-looking IDs, e.g., MD5 hashes, but I also want them to be small. Four to six characters, along the lines of tinyurl, would be ideal. The IDs will be for user-generated content, in the context of my application, things like test questions that people will be writing. It's not necessary that the IDs be random-looking (it's fine if they look like serial IDs), but the approach I'm thinking of using lends itself to this, so it's not really an issue.
People familiar with Google App Engine will know that writes to the data store are particularly expensive and can result in timeouts if there are too many of them to the same entity group. Sharded counters are an approach that is often used to avoid write contention on a single global counter and the failed transactions that go with it.
Along with getting short IDs and avoiding write contention, I'm trying to avoid the birthday paradox. I would like to prepare for the possibility of there being millions of IDs, even if this is going overboard a little.
I was thinking of using a sharded counter along the following lines:
- The counter is sharded on users, so that there is a shard for each user. Each counter object has its own count that is specific to a given user, which is incremented when a new item is created by that user. The count is incremented regardless of whether an item is successfully created.
- The basis of an ID is an MD5 hash of the following string: "<user-email-address>|<latest-counter-value>".
- The resulting MD5 hash is then truncated, initially to four characters.
- A single global "length" value is maintained. Whenever the previous steps result in a duplicate key (one imagines this will happen fairly quickly at first), the value of length will be increased by one. MD5 hashes for new IDs will now be truncated at "length" characters, rather than four characters.
- I don't want to expose the user email address, which suggests that a hash of some kind would be a good way to go.
My questions are: Am I right in thinking that this will largely avoid write contention as a result of duplicate keys and that write contention on the length field would probably not be an issue, especially at longer lengths? Can anyone describe the mathematics involved here? Would the length quickly increase to near the length of an MD5 hash, calling into question the value of the whole approach? Would it be better just to go with the full (longer) MD5 hash in order to keep things easier to maintain? Is there anything that I'm overlooking?