views:

162

answers:

4

I am working on address book synchronization algorithm. I would like to reuse some code if there exists, but couldn't find one yet.

Does someone know about an algorithm that will tell me in numbers/float/procent how much two names are identical. Levenstein distance is not good in this approach, as names and our adddress books are matching the begining of each of the name sections.

John Smith should match
Smith Jon, Jonathan Smith, Johnny Smith

+1  A: 

To really get those kinds of cases you may need an aliases table, but I think Soundex will get you close.

http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html

dj_segfault
+2  A: 

You should be looking at string comparison algorithms such as Levenshtein or Smith-Waterman. Here is a great library to get you started

Mikos
+1  A: 

For names, I came up with an algorithm similar to metaphone.

You also need some logic to break up the string into surname, given names, title etc. It can get complicated.

There are edge cases. If someone has the title "Professor" you don't want that interpreted as a first name. And if they have "Lord" at the start that could either be their first name (plenty of people are called Lord) or their title. And so on. It's best if you have their name already in a standard form where you know what is their surname, given names and title.

I've written some PHP code to do this: see name (see similarityto() function), textfuzzy, probability.

thomasrutter
+1  A: 

Have a look at the Jaro Winkler algorithm too. It is good for names. http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

If you have first name, last name issues then you could just sort them to make sure Smith John is saved as John Smith

Andrew White
I am choosing this answer, as you directly pointed to the algorithm although another response have earlier submitted with the same website linkage.
Pentium10