views:

456

answers:

4

I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).

Here's what i've learned so far:

  • comparing Personal Names can't be solved 100%
  • there are ways to achieve certain degree of accuracy.
  • the answer will be locale-specific, that's OK.

I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.

For example, all the names below can refer to the same person:

  • Berry Tsakala
  • Bernard Tsakala
  • Berry J. Tsakala
  • Tsakala, Berry

I'm trying to:

  1. build (or copy) an algorithm which grades the relationship 2 input names
  2. find an indexing method (for names in my database, for hash tables, etc.)

note: My task isn't about finding names in text, but to compare 2 names. e.g.

name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
+2  A: 

Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.

Jacob
Yes @Jacob soundex is right thing but @berry will have to found some good implementation in the language he is using.
TheVillageIdiot
That's nice, and google finds many soundex libraries and online converters. However, Bernard!=Barry in Soundex.
Adam Matan
Wrong answer. Soundex overcomes bad spelling, not different order.I wrote explicitly - spelling is always correct.
Berry Tsakala
Sorry, I guess I missed the point of your post. I see now that you're not trying to correct others' misrepresentations of a name, but correct representations of a name.
Jacob
+6  A: 

I used Tanimoto Coefficient for a quick (but not super) solution, in Python:

"""
Formula:
  Na = number of set A elements
  Nb = number of set B elements
  Nc = number of common items

  T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
    c = [v for v in a if v in b]
    return float(len(c)) / (len(a)+len(b)-len(c))

def name_compare(name1, name2):
    return tanimoto(name1, name2)


>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>

Edit: A link to a good and useful book.

Nick D
tanimoto is perfectly happy taking strings, no need to list-ify them
Jimmy
Oops! Jimmy you are right, thanks!
Nick D
Very interesting! Thanks. It actually gives me a meaningful numerical result. (i'm trying to compile <a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html">this one</a> for all our platforms... couldn't find a binary implementation)
Berry Tsakala
Berry, I'm glad I could help you :)
Nick D
A: 

Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.

Jacob
+1  A: 

We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.

Steve Mc
Nice, although middle names play very important role in distinguishing common names from each other. Take, for example, Spanish names: adding a single letter or middle name narrows down significantly a possible match against a list of alternative names.
Berry Tsakala
I did say that if you discount non-english names. If you're working in a cultural situation where middle names were significant then you would obviously change the logic. The second name would effectively become part of the first name. In that situation I would be trying to get a match on forename, second name and surname and then try again without second name and perhaps rank the results accordingly
Steve Mc