views:

1483

answers:

9

I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.

So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

+3  A: 

While I do not have an algorithm for you, my first action would be to take a look at the process involved in entering a new contact. Perhaps users do not have an easy way to find the contact they are looking for. Much like on Stack Overflow's new question form, you could suggest contacts that already exist on the new contact screen.

Kevin Lamb
+2  A: 

I imagine that this problem is well understood but what occurs to me on first reading is:

  • compare fields individually
  • count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
  • present for human intervention any cases which pass some threshold

Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.

You may prefer a fairly strong bias toward false positives, at least at first.

dmckee
+2  A: 

You can compare the names with the Levenshtein distance. If the names are the same, the distance is 0, else it is given by the minimum number of operations needed to transform one string into the other.

Peter Hoffmann
How would this detect that Bill=William? His problem is duplicates, not spelling mistakes in exact duplicates.
Osama ALASSIRY
+2  A: 

This may or may not be related but, minor misspellings might be detected by a Soundex search, e.g., this will allow you to consider Britney Spears, Britanny Spares, and Britny Spears as duplicates.

Nickname contractions, however, are difficult to consider as duplicates and I doubt if it is wise. There are bound to be multiple people named Bill Smith and William Smith, and you would have to iterate that with Charles->Chuck, Robert->Bob, etc.

Also, if you are considering, say, Muslim users, the problems become more difficult (there are too many Muslims, for example, that are named Mohammed/Mohammad).

Jon Limjap
Entering Mohammed/Mohammad/Mohd is not an issue unless you're transliterating it from Arabic, most have a preferred way and always write it that way.Searching for an Arabic name is a different issue and is hard.
Osama ALASSIRY
+3  A: 

If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.

http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services

http://msdn.microsoft.com/en-us/library/ms137786.aspx

jms
A: 

I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.

For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.

There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.

Matt Sheppard
+2  A: 

If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.

Tyler
+3  A: 

So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".

The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.

To your regular Address table, add Root-versions of the names, e.g Person (Firstname, RootFirstName, Surname, Rootsurname....)

Now, create a mapping table. FirstNameMappings (Primary KEY Firstname, Rootname)

Populate your Mapping table by: Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings

This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"

Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William" This is very time consuming, but if data quality really is important for you I think it's one of the best ways.

Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.

Tnilsson
A: 

You might also want to look into probabilistic matching.