views:

202

answers:

6

I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.

I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.

Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.

Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.

So, how do I go about making an algorithm that can figure this out?

PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.

PPS Some examples would be similar to what m3rLinEz wrote:

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Original'

ID: 9876234 Fname: Jose     LName: Guitierrez       Birthdate:10/20/84  '- Typo in birth date'
ID: 0876234 Fname: Jose     LName: Guitierrez       Birthdate:01/20/84  '- Wrong ID'
ID: 9876234 Fname: Jose     LName: Guitierrez-Brown Birthdate:01/20/84  '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez       Birthdate:01/20/84  '- Added middle initial'
ID: 3453555 Fname: Joseph   LName: Guitierrez       Birthdate:01/20/84  '- Probably someone else with same birthdate and same last name'
A: 

Regular expressions are what you need, why reinvent the wheel?

Greg
Now you have 2 problems.
Robert
I disagree that regular expressions are an automatic problem, but I do agree that regular expressions are not the answer in this case.
Aaron
+7  A: 

You might be interested in Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.1

It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}

For example:

  • Record A - ID: 4831213321, Name: Jane
  • Record B - ID: 431213321, Name: Jann
  • Record C - ID: 4831211021, Name: John

The distance between A and B will be lower than A and C / B and C, which indicates better match.

m3rLinEz
A: 

If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.

Jon
A: 

If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.

mjh2007
Unfortunately it's not consistent year to year. It's state/government data and it seems they change their format every year.
Mikecancook
Well you could use a different table for each year the data comes in, but that would get annoying quick.
mjh2007
Are you having a problem doing the matching for all records or are you just looking for a way to match the non-perfect matches?
mjh2007
I'm updating an import tool that was developed 5 years ago. Currently its a case statement that tried to make exact matches but when it doesn't they have to be matched manually. Which is time consuming when there are several hundred that have to be looked at.
Mikecancook
A: 

I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.

For the remaining rows you can use a simplified version of Bayes' Theorem.

For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.

For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.

Mark Byers
+1  A: 

When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..

Good luck!

Trevoke