How to map string keys to unique integer IDs?

I have some data that comes regularily as a dump from a data souce with a string natural key that is long (up to 60 characters) and not relevant to the end user. I am using this key in a url. This makes urls too long and user unfriendly.

I would like to transform the string keys into integers with the following requirements:

The source dataset will change over time.

The ID should be:

non negative integer
unique and constant even if the set of input keys changes
preferrably reversible back to key (not a strong requirement)

The database is rebuilt from scratch every time so I can not remember the already assigned IDs and match the new data set to existing IDs and generate sequential IDs for the added keys.

There are currently around 30000 distinct keys and the set is constantly growing.

How to implement a function that will map string keys to integer IDs?

What I have thought about:

1. Built-in string.GetHashCode:

ID(key) = Math.Abs(key.GetHashCode())

is not guaranteed to be unique
(not reversible)

1.1 "Re-hashing" the built-in GetHashCode until a unique ID is generated to prevent collisions.

existing IDs may change if something colliding is added to the beginning of the input data set

2. a perfect hashing function

I am not sure if this can generate constant IDs if the set of inputs changes
(not reversible)

3. translate to base 36/64/??

does not shorten the long keys enough

What are the other options?

still way too long. Given that I have 30000 distinct keys, It should be possible to have 30000 IDs = 5 ascii bytes in url

Marek 2010-04-17 07:17:42

@Marek: I'm not sure it is possible... unless you can tell us more about the properties of those keys, then we have to map the integers over the entire set of possible inputs to ensure no collisions. "1 to 60 character" keys is not enough info for us to be able to reduce it by much.

Mark 2010-04-17 07:24:24

@Mark: there are a few small patterns in the strings but I do not think this is relevant to the question. An example key: 2.0_SomeNameThatMayRepeat_SubTypeThatWillNotRepeat (the length more or less matches the real data)

Marek 2010-04-17 07:30:33

You could simply truncate sha1 or md5 hash until it's small enough, but you're basically shooting arrows up in the air and trusting that you're a small enough target. 30,000 five-character (base64) keys have a 34% probability of colliding.

Marcelo Cantos 2010-04-17 07:50:03

Thanks for a practical suggestion, but this is not possible due to performance limitations - the whole system is in-memory, there is no persistent database.

Marek 2010-04-17 07:21:30

I see. But as Gulfa says: *ou can only do that if you can keep a list of assigned IDs.* To do that is cheapest (in dev time) if you use a DB (e.g. SQLite). 30'000 rows/day may seem big to you. But handled with the propper tools, it isn't

lexu 2010-04-17 08:27:09

@lexu: there are 30000 keys, each key has multiple representations. Currently, we have 600.000 rows (expected to double soon). And the processing does not occur once a day but once every 20 minutes with some strict response time restrictions. That's why we have skipped the database altogether. A (relational or not) database would probably be able handle this, but with much more expensive hardware.

Marek 2010-04-17 09:16:48

For in-memory solutions AVL trees are good. They have a O(log N) search performance and O(log N) insert. This is a kind of table that will only grow so AVL will do better than Red/Black, since AVL is better balanced. Also given the fact that the minimum size of a data item is 4 bytes (32 bit integer), any in memory table will never grow larger than 1G items -> 30 search OPS per lookup. If each iteration takes 100ns (DRAM access+processing), you will handle 300K lookups/second using one CPU.

Ernelli 2010-04-17 09:25:57

Thanks, makes sense - and more or less, this is one of my premises. But I would like to have some practical workaround around this that relies on the fact that: 1) the keys in the input set are all known 2) I can detect and resolve duplicates yielding from new values in the set (re-generate ID using a second (n-th) algorithm).

Marek 2010-04-17 07:43:29

+1 for that clear statement in your first line.

lexu 2010-04-17 08:28:40

@Marek: As you want the key for a value to be is constant regardless of the current set, the fact that you know the current set is no help. Regenerating ID when you find a duplicate would mean that the keys are not constant.

Guffa 2010-04-17 08:41:16

I ended up keeping a lightweight persistent list of old mappings to resolve collisions

Marek 2010-04-19 20:41:36

ansaurus

tags:

views:

answers:

How to map string keys to unique integer IDs?

related questions