views:

1201

answers:

9

I'm doing a website migration that involves extracting firstname and lastname from fullname. Given these were created by the end user, all kinds of permutations exist (although English and generally not too strange). Mostly I can take the first word as firstname and the last word as the lastname but have some exceptions from the occasional prefix and suffix. In going through the data and trying to get my head around all the likely exceptions I realized that this is a common problem that has been at least partially solved many times before.

Before reinventing the wheel, does anyone have any regular expressions that have worked for them or useful code? Performance is not a consideration as this is a one-time utility.

Typical values to be handled:

Jason Briggs, J.D. Smith, John Y Citizen, J Scott Myers, Bill Jackobson III, Mr. John Mills


Update: while a common problem, the typical solution seems to involve handling the majority of cases and manually cleaning the rest.

(Given the frequency this issue must be experienced I was originally expecting to find a utility library out there but was not able to find one myself with Google)

+2  A: 

If this is a one shot deal then I would strongly consider paying someone else who is a specialist to do it for you.

They will be experienced in working with poorly structured data sets.

I have no affiliation with them but Melissa Data provide a service that seems tailored to this sort of thing.

ShuggyCoUk
+4  A: 

It's probably impossible to do (reliably).

Even if you can do that for some names, you will get a Spanish person at some point, who will write down both family names. Or some people (forgot which nationality it is) that will put in "lastname firstname". Or one of many other situations...

The best you can probably do is split 2 words as first and last name, then go through the rest manually (yourself, or hire some professionals)...

viraptor
I think you are correct. The language permutations and unstructured nature lead make this an indeterminate problem as Apoorv pointed out.
Stuart
And don't forget last names like O'Neill and Van Der Spek and Van Eck and Hart-Mahon and de la Cruz...
jeffamaphone
+2  A: 

This is an indeterminate problem (or an Oracle problem as I like to call it) and is unsolvable in a reliable way. That is because of existance of names that are both first names and last names e.g., Stanley, Jackson etc. But a try can be given. You need to write a learning program that will be given a set of first names and last names and it will maintain a dictionary of these names, mapped against the probability that it is a first name.

Now, pass all your values to be migrated and using these probabilities you can get a reasonable split between first and last names. Furthermore, if a particular name becomes ambiguous (totally upon you to define ambiguous, but I would define it as the bottom 30 percentile of all probability values I have obtained) then you can flag it for review later.

Hope this helps.

Cheers!

Ak
Furthermore for cases like J.D. Smith, you can mostly treat J.D. was first name and Smith as last name.
Ak
+9  A: 

My recommendation would be the following:

1) Split the names on the spaces.

2) Check the length of the returned array. If 2, easy split. If more, next.

3) Compare the 1st value for prefixes (i.e. Mr. Mrs. Ms. Dr.)...if so, remove it else move to next.

4) Compare the 1st value for length. If it's just 1 character, combine first 2 items in the array.

It's still not fool proof; however, it should address at least 80 per cent of your cases.

Hope this helps.

JamesEggers
I would agree with this, if you can break the data down into various reliably parsed data sets, you may find that the remaining "trouble cases" are small enough to have a human handle.
smercer
James - thanks for your very practical ideas. Given the data is generally fairly good, I think this should address ~95% of cases.
Stuart
+1  A: 

If you only have a few users (<100k) then see if you can get somebody to do it manually, and use your time on something worthwhile. Since it is a one time job the ROI sucks :-)

Kasper
There are around 10K users, so you are probably right - it's the irrational programmer compulsion to spend 5 hours trying to resolve half of the edge cases that would take an hour of 'manual' cleansing by an intern.
Stuart
exactly ;-) I only know it to well
Kasper
+3  A: 

The fastest thing to do is a hybrid algorithm-human approach. You don't want to spend the time putting together a system that works 99.99% of the time because the last 5-10% of optimization will kill you. Also, you don't want to just dump all of the work on a person because most of the cases (I'm guessing) are fairly straightforward.

So, rapidly build something like what JamesEggers suggested, but catch all of the cases that appear unusual or do not fit your predefined conversions. Then, simply go through those cases manually (It shouldn't be too many).

You could go through those cases by yourself or outsource them to other users by setting up HITs in Mechanical Turk:

http://aws.amazon.com/mturk/

(Assuming 500 cases at $0.05 (high reward) your total cost should be $25 at most)

Robert Venables
I think you have the balance right here.
Stuart
+1  A: 

I dug up a very simple (80% probably) regex I had in perl and added some happy C# group names:

(?<title>(mr|ms|mrs|miss|dr|hon)\.?\s+)?(?<firstandmiddle>.+)\s+(?<last>((van|de|von)\s+)?\S+)(?<junior>\s+(jr|sr|ii|iii|iv)\.?)

I'm posting as wiki, so anybody feel free to add stuff that they think would help!

Mike
Thanks Mike, I'll see how it goes on my data.
Stuart
+1  A: 

As others pointed out there is no solution that works in all cases. One reason for this is that there are names that can be used as a first as well as a last name.

You could use a database of first names and find out which parta of the name are possible first names. If you also know the country of the person with a particular name you can increase accuracy a lot.

For a free database of first names see this answer.

Ludwig Weinzierl
+1  A: 

if your data universe is <10k names and its a one time deal implement one of the split scenarios described by other posters into an intermediate file then go through manually and look at and update where necessary (you'd be surprised how little time it takes to vet 10k names). It will take you less time than trying to find and or build the perfectly implemented algorithm. Once your universe of names >100k then its worth trying to program your way out of it and spinning off a file for manual review and modification of all names that don't give you a perfect firstname, lastname split.

kloucks