views:

253

answers:

3

I have some data that comes regularily as a dump from a data souce with a string natural key that is long (up to 60 characters) and not relevant to the end user. I am using this key in a url. This makes urls too long and user unfriendly.

I would like to transform the string keys into integers with the following requirements:

The source dataset will change over time.

The ID should be:

  • non negative integer
  • unique and constant even if the set of input keys changes
  • preferrably reversible back to key (not a strong requirement)

The database is rebuilt from scratch every time so I can not remember the already assigned IDs and match the new data set to existing IDs and generate sequential IDs for the added keys.

There are currently around 30000 distinct keys and the set is constantly growing.

How to implement a function that will map string keys to integer IDs?

What I have thought about:

1. Built-in string.GetHashCode:

ID(key) = Math.Abs(key.GetHashCode())

  • is not guaranteed to be unique
  • (not reversible)

1.1 "Re-hashing" the built-in GetHashCode until a unique ID is generated to prevent collisions.

  • existing IDs may change if something colliding is added to the beginning of the input data set

2. a perfect hashing function

  • I am not sure if this can generate constant IDs if the set of inputs changes
  • (not reversible)

3. translate to base 36/64/??

  • does not shorten the long keys enough

What are the other options?

+1  A: 

A Base64-encoded sha1sum is 27 characters. base64(md5(...)) is 22 characters. Any smaller and you will have a non-negligible risk of collisions.

Perfect hashing functions aren't possible when the set of inputs changes.

Marcelo Cantos
still way too long. Given that I have 30000 distinct keys, It should be possible to have 30000 IDs = 5 ascii bytes in url
Marek
@Marek: I'm not sure it is possible... unless you can tell us more about the properties of those keys, then we have to map the integers over the entire set of possible inputs to ensure no collisions. "1 to 60 character" keys is not enough info for us to be able to reduce it by much.
Mark
@Mark: there are a few small patterns in the strings but I do not think this is relevant to the question. An example key: 2.0_SomeNameThatMayRepeat_SubTypeThatWillNotRepeat (the length more or less matches the real data)
Marek
You could simply truncate sha1 or md5 hash until it's small enough, but you're basically shooting arrows up in the air and trusting that you're a small enough target. 30,000 five-character (base64) keys have a 34% probability of colliding.
Marcelo Cantos
A: 

Set up a second, persistant DB and store your KEY/ID pairs there. Make sure you also have the data's date in the table so you can do some house-keeping.

lexu
Thanks for a practical suggestion, but this is not possible due to performance limitations - the whole system is in-memory, there is no persistent database.
Marek
I see. But as Gulfa says: *ou can only do that if you can keep a list of assigned IDs.* To do that is cheapest (in dev time) if you use a DB (e.g. SQLite). 30'000 rows/day may seem big to you. But handled with the propper tools, it isn't
lexu
@lexu: there are 30000 keys, each key has multiple representations. Currently, we have 600.000 rows (expected to double soon). And the processing does not occur once a day but once every 20 minutes with some strict response time restrictions. That's why we have skipped the database altogether. A (relational or not) database would probably be able handle this, but with much more expensive hardware.
Marek
For in-memory solutions AVL trees are good. They have a O(log N) search performance and O(log N) insert. This is a kind of table that will only grow so AVL will do better than Red/Black, since AVL is better balanced. Also given the fact that the minimum size of a data item is 4 bytes (32 bit integer), any in memory table will never grow larger than 1G items -> 30 search OPS per lookup. If each iteration takes 100ns (DRAM access+processing), you will handle 300K lookups/second using one CPU.
Ernelli
+1  A: 

You can only do that if you can keep a list of assigned IDs.

For any give algorithm that actually gives you unique ID for the current set, any new value is not guaranteed to get a unique ID.

The strings contain about 400 bits of information, so to get an integer that is guaranteed to be unique it would have to contain all the information from the string and be about 400 bits. That's a 120 characters expressed as a decimal number so that's not shorter than what you have now.

Guffa
Thanks, makes sense - and more or less, this is one of my premises. But I would like to have some practical workaround around this that relies on the fact that: 1) the keys in the input set are all known 2) I can detect and resolve duplicates yielding from new values in the set (re-generate ID using a second (n-th) algorithm).
Marek
+1 for that clear statement in your first line.
lexu
@Marek: As you want the key for a value to be is constant regardless of the current set, the fact that you know the current set is no help. Regenerating ID when you find a duplicate would mean that the keys are not constant.
Guffa
I ended up keeping a lightweight persistent list of old mappings to resolve collisions
Marek