ansaurus

Question

Human name comparison: ways to approach this task

Answer 1

+2 A:

Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.

Jacob 2009-06-21 07:31:26

Yes @Jacob soundex is right thing but @berry will have to found some good implementation in the language he is using.

TheVillageIdiot 2009-06-21 07:52:29

That's nice, and google finds many soundex libraries and online converters. However, Bernard!=Barry in Soundex.

Adam Matan 2009-06-21 12:30:35

Wrong answer. Soundex overcomes bad spelling, not different order.I wrote explicitly - spelling is always correct.

Berry Tsakala 2009-06-21 13:05:07

Sorry, I guess I missed the point of your post. I see now that you're not trying to correct others' misrepresentations of a name, but correct representations of a name.

Jacob 2009-06-21 16:25:41

Answer 2

+6 A:

I used Tanimoto Coefficient for a quick (but not super) solution, in Python:

"""
Formula:
  Na = number of set A elements
  Nb = number of set B elements
  Nc = number of common items

  T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
    c = [v for v in a if v in b]
    return float(len(c)) / (len(a)+len(b)-len(c))

def name_compare(name1, name2):
    return tanimoto(name1, name2)


>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>

Edit: A link to a good and useful book.

Nick D 2009-06-21 07:54:27

tanimoto is perfectly happy taking strings, no need to list-ify them

Jimmy 2009-06-21 08:05:39

Oops! Jimmy you are right, thanks!

Nick D 2009-06-21 08:08:18

Very interesting! Thanks. It actually gives me a meaningful numerical result. (i'm trying to compile <a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html">this one</a> for all our platforms... couldn't find a binary implementation)

Berry Tsakala 2009-06-21 21:56:54

Berry, I'm glad I could help you :)

Nick D 2009-06-22 03:01:10

Answer 3

A:

Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.

Jacob 2009-06-21 16:46:35

Answer 4

+1 A:

We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.

Steve Mc 2009-06-21 18:53:52

Nice, although middle names play very important role in distinguishing common names from each other. Take, for example, Spanish names: adding a single letter or middle name narrows down significantly a possible match against a list of alternative names.

Berry Tsakala 2009-06-21 21:50:18

I did say that if you discount non-english names. If you're working in a cultural situation where middle names were significant then you would obviously change the logic. The second name would effectively become part of the first name. In that situation I would be trying to get a match on forename, second name and surname and then try again without second name and perhaps rank the results accordingly

Steve Mc 2009-06-23 17:31:55

ansaurus

tags:

views:

answers:

Human name comparison: ways to approach this task

related questions