views:

160

answers:

5

I have two tables that I would like to compare for duplicates. These tables are just basic company information fields like name, city, state, etc. The only possibly common field that I can see would be the name column but the names are not quite exact. Is there a way that I can run a comparison between the two using a LIKE statement? I'm also open to any additional suggestions that anyone may have.

Thanks.

+1  A: 

SOUNDEX() will help you to a certain extent. But it is far from perfect.

soundex(string1) is expected to be equal to soundex(string2) even if string1 and string2 are spelled differantly. But as I said, It is far from perfect.

As far as I know, there is no existing algorithm which does this perfectly.

Alterlife
Thanks for the input, Alterlife. It looks like it isn't going to be possible to do this but it was worth a try. I am just trying to manage several marketing lists and now they are kind of getting out of control.
+3  A: 

I would try matching using a Double Metaphone algorithm, which is a more sophisticated SOUNDEX-type algorithm.

Here is a MySQL implementation.

RedFilter
Hi OrbMan, thanks for that link. I'll go there now.
That is indeed interesting. +1 thanks for the link.
Alterlife
Double Metaphone is a sound approach. +1 for the MySQL implementation
APC
Well OrbMan, Alterlife and APC agree that this is probably the best way to go. I had a quick read of the link and will give this a try now. Thanks.
+1  A: 

There are companies who make a good living by selling Data Cleansing products which undertake this kind of fuzzy matching. So it seems improbable that you could solve this with a simple (or even an extremely complicated) LIKE statement.

What you need is something which can compare two strings and return a score for similarity, a score of 100% meaning identical. Something like the Jaro-Winkler algorithm. Alternative algorithms include Metaphone (or Double Metaphone) and Soundex(). Soundex() is the crudest solution.

An alternative solution would be to use a specialist text index. The cool thing about this approach is that we can specify a thesaurus to specify synonyms which iron out irrelevant differences (INC = INCORPORATED, CO = COMPANY, etc).

Oracle and SQL Server include such a tool but I'm not familiar with MySQL.

APC
This Jaro-Winkler distance sounds like it will be better than the general Levenshtein distance for names. +1.
j_random_hacker
Yes, by using a scoring system for matching, sounds like the best way to go.
A: 

Well, there's no 100% guaranteed-correct way, no. But you can probably make some progress by transforming all "messy" columns into a more canonical form, e.g. by capitalising everything, trimming leading and trailing spaces and ensuring at most 1 space appears in a row. Also things like changing names of the form "SMITH, JOHN" to "JOHN SMITH" (or vice versa -- just pick a form and go with it). And of course you should make copies of the records, don't change the originals. You can experiment with discarding further information (e.g. "JOHN SMITH" -> "J SMITH") -- you'll find this changes the balance of false positives to false negatives.

I would probably take the approach of assigning a similarity score to each pair of records. E.g. if the canonicalised names, addresses and email addresses agree exactly, assign a score of 1000; otherwise, subtract (some multiple of) the Levenshtein distance from 1000 and use that. You'll need to come up with your own scoring scheme by playing around and deciding the relative importance of different types of differences (e.g. a different digit in a phone number is probably more important than a 1-character difference in two people's names). You will then experimentally be able to establish a score above which you can confidently assign a status of "duplicate" to a pair of records, and a lower score above which manual checking is required; below that score, we can confidently say that the 2 records are not duplicates.

The realistic goal here is to reduce the amount of manual duplicate-removal work you'll need to do. You are unlikely to be able to eliminate it entirely, unless all the duplicates were generated through some automatic copying process.

j_random_hacker
Well, I think what I will probably wind up doing in the mean time is matching on the exact names. This should wee out at least 500 or so companies between the two lists. Thanks for the great explanation j_random_hacker.
A: 

"If you wish to consider an external tool, columbo may help you. if the names are not quite exact, the only thing i can think about is what alterlife said, SOUNDEX(), its not guaranteed to work 100%, but it could get quite close.

Itamar