views:

300

answers:

6

My problem:

I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?

In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?

>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734

Application Context (for those that are interested):

This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.

+4  A: 

I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.

Paul Tomblin
A: 

Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).

but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...

Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...

You could use AAAAA01  for first person at first address,
              AAAAA02 for second person at first address,
              AAAAB07 for the seventh resident at the second adresss, etc.

If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...

Charles Bretana
A: 

I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?

So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).

If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.

EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.

Is that feasible in your situation?

Rob Williams
For this app we need to return the actual database id if a matching record exists. If not, we need to return a psuedo id generated from the search input. The psuedo id will be prefixed to differentiate it from a real id. Just one of those strange customer requirements.
A: 

A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.

Tim
For a record linkage app. Given a input name and address we search a very large database for the best match and return the database id and other data. If there isn't a match I need to generate this psuedo id from the search input.
A: 

I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:

Create a SHA1 hash from the data, and store it in a table with a serial number field.

Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.

+1  A: 
  • Use a cryptographic hash for its collision resistance, not its other qualities
  • Use as many bytes from the hash as you want (truncate)
  • convert to alpha-numeric characters
  • You can also truncate the alpha-numeric string instead of the hash

An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.

N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
  hash = hashlib.md5(name + "|" + address).digest().encode("base64")
  alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
  return alnum_hash[:N_HASH_CHARS]

How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).

orip