ansaurus

Question

How do I go about building a matching algorithm?

Answer 1

A:

Regular expressions are what you need, why reinvent the wheel?

Greg 2010-01-29 17:47:42

Now you have 2 problems.

Robert 2010-01-29 17:51:15

I disagree that regular expressions are an automatic problem, but I do agree that regular expressions are not the answer in this case.

Aaron 2010-01-29 17:53:27

Answer 2

+7 A:

You might be interested in Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.1

It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}

For example:

Record A - ID: 4831213321, Name: Jane
Record B - ID: 431213321, Name: Jann
Record C - ID: 4831211021, Name: John

The distance between A and B will be lower than A and C / B and C, which indicates better match.

m3rLinEz 2010-01-29 17:55:26

Answer 3

A:

If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.

Jon 2010-01-29 17:57:13

Answer 4

A:

If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.

mjh2007 2010-01-29 17:59:24

Unfortunately it's not consistent year to year. It's state/government data and it seems they change their format every year.

Mikecancook 2010-01-29 18:15:28

Well you could use a different table for each year the data comes in, but that would get annoying quick.

mjh2007 2010-01-29 18:18:36

Are you having a problem doing the matching for all records or are you just looking for a way to match the non-perfect matches?

mjh2007 2010-01-29 18:21:20

I'm updating an import tool that was developed 5 years ago. Currently its a case statement that tried to make exact matches but when it doesn't they have to be matched manually. Which is time consuming when there are several hundred that have to be looked at.

Mikecancook 2010-01-29 19:24:08

Answer 5

A:

I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.

For the remaining rows you can use a simplified version of Bayes' Theorem.

For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.

For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.

Mark Byers 2010-01-29 18:01:42

Answer 6

+1 A:

When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..

Good luck!

Trevoke 2010-01-29 18:15:33

ansaurus

tags:

views:

answers:

How do I go about building a matching algorithm?

related questions