views:

3885

answers:

18

A lot of contact management programs do this - you type in a name (e.g., "John W. Smith") and it automatically breaks it up internally into:

First name: John
Middle name: W.
Last name: Smith

Likewise, it figures out things like "Mrs. Jane W. Smith" and "Dr. John Doe, Jr." correctly as well (assuming you allow for fields like "prefix" and "suffix" in names).

I assume this is a fairly common things that people would want to do... so the question is... how would you do it? Is there a simple algorithm for this? Maybe a regular expression?

I'm after a .NET solution, but I'm not picky.

Update: I appreciate that there is no simple solution for this that covers ALL edge cases and cultures... but let's say for the sake of argument that you need the name in pieces (filling out forms - as in, say, tax or other government forms - is one case where you are bound to enter the name into fixed fields, whether you like it or not), but you don't necessarily want to force the user to enter their name into discrete fields (less typing = easier for novice users).

You'd want to have the program "guess" (as best it can) on what's first, middle, last, etc. If you can, look at how Microsoft Outlook does this for contacts - it lets you type in the name, but if you need to clarify, there's an extra little window you can open. I'd do the same thing - give the user the window in case they want to enter the name in discrete pieces - but allow for entering the name in one box and doing a "best guess" that covers most common names.

+22  A: 

If you must do this parsing, I'm sure you'll get lots of good suggestions here.

My suggestion is - don't do this parsing.

Instead, create your input fields so that the information is already separated out. Have separate fields for title, first name, middle initial, last name, suffix, etc.

shadit
"...create your input fields so that the information is already separated out."That's true. I guess this is the Best Practice on this kind of stuff. This is also true with the "Address"
MarlonRibunal
What about people with only one name? People whose first name is the name of formal address (Japan, China, Korea)? If nothing else, you'd better not address "Mausam" as "Dear Mausam [NULL]..."
crosstalk
Agreed. I took over a legacy app and I'm forced to parse the names because it's all stored together. It's a pain and I'm always having to add some extra logic to the routine to account for an outlier. User input varies, of course, so it's largely a crapshoot.
Mike L
Good points. I guess depending upon the expected user community, it might be necessary to provide a single input field that still does not get parsed - just used as-is.
shadit
Agreed. In additional to other comments you run the risk of **seriously** pissing people off because you are ignorant of the cultural issues on how their name is composed.
Simon Munro
Absolutely. When it comes to forms and such, the user is in the single best position to distinguish what the names should be.
Greg D
+5  A: 

There is no simple solution for this. Name construction varies from culture to culture, and even in the English-speaking world there's prefixes and suffixes that aren't necessarily part of the name.

A basic approach is to look for honorifics at the beginning of the string (e.g., "Hon. John Doe") and numbers or some other strings at the end (e.g., "John Doe IV", "John Doe Jr."), but really all you can do is apply a set of heuristics and hope for the best.

It might be useful to find a list of unprocessed names and test your algorithm against it. I don't know that there's anything prepackaged out there, though.

Stephen Deken
Shad's answer is more correct when it comes to making actual systems.
Bob Cross
A: 

You can do the obvious things: look for Jr., II, III, etc. as suffixes, and Mr., Mrs., Dr., etc. as prefixes and remove them, then first word is first name, last word is last name, everything in between are middle names. Other than that, there's no foolproof solution for this.

A perfect example is David Lee Roth (last name: Roth) and Eddie Van Halen (last name: Van Halen). If Ann Marie Smith's first name is "Ann Marie", there's no way to distinguish that from Ann having a middle name of Marie.

Graeme Perrow
Any explanation for the downvote after almost a year?
Graeme Perrow
A: 

I would say Strip out salutations from a list then split by space, placing list.first() as first name, list.last() as last name then join the remainder by a space and have that as a middle name. And ABOVE ALL display your results and let the user modify them!

George Mauer
A: 
osp70
A: 

Sure, there is a simple solution - split the string by spaces, count the number of tokens, if there is 2, interpret them to be FIRST and LAST name, if there is 3, interpret it to be FIRST, MIDDLE, and LAST.

The problem is that the simple solution will not be a 100% correct solution - someone could always enter a name with many more tokens, or could include titles, last names with a space in it (is this possible?), etc. You can come up with a solution that works for most names most of the time, but not an absolute solution.

I would follow Shad's recommendation to split the input fields.

matt b
As you point out, that approach would fail for Mr. T. I pity the fool who parses his name wrong!
shadit
+2  A: 

You probably don't need to do anything fancy really. Something like this should work.

    Name = Name.Trim();

    arrNames = Name.Split(' ');

    if (Name.Length > 0) {
        GivenName = arrNames[0];
    }
    if (Name.Length > 1) {
        FamilyName = arrNames[arrNames.Length - 1];
    }
    if (Name.Length > 2) {
        MiddleName = string.Join(' ', arrNames, 1, arrNames.Length - 2);
    }

You may also want to check for titles first.

Vincent McNabb
If the design dictates that this parsing must take place, I agree this would be a good type of approach.
shadit
A: 

You don't want to do this, unless you are only going to be contacting people from one culture.

For example:

Guido van Rossum's last name is van Rossum.

MIYAZAKI Hayao's first name is Hayao.

The most success you could do is to strip off common titles and salutations, and try some heuristics.

Even so, the easiest solution is to just store the full name, or ask for given and family name seperately.

1729
A: 

This is a fools errand. Too many exceptions to be able to do this deterministically. If you were doing this to pre-process a list for further review I would contend that less would certainly be more.

  1. Strip out salutations, titles and generational suffixes (big regex, or several small ones)
  2. if only one name, it is 'last'.
  3. If only two names split them first,last.
  4. If three tokens and middle is initial split them first, middle, last
  5. Sort the rest by hand.

Any further processing is almost guaranteed to create more work as you have to go through recombining what your processing split-up.

Frosty
+2  A: 

I appreciate that this is hard to do right - but if you provide the user a way to edit the results (say, a pop-up window to edit the name if it didn't guess right) and still guess "right" for most cases... of course it's the guessing that's tough.

It's easy to say "don't do it" when looking at the problem theoretically, but sometimes circumstances dictate otherwise. Having fields for all the parts of a name (title, first, middle, last, suffix, just to name a few) can take up a lot of screen real estate - and combined with the problem of the address (a topic for another day) can really clutter up what should be a clean, simple UI.

I guess the answer should be "don't do it unless you absolutely have to, and if you do, keep it simple (some methods for this have been posted here) and provide the user the means to edit the results if needed."

Keithius
If your are going to be displaying the results of your guess on the screen why not forgo the guessing and let the user enter their own text into the fields? Displaying the guesses introduces as much clutter as skipping the guesswork.
Frosty
Not if it's a separate window :-)
Keithius
I agree with you.. "Don't do it" isn't a very helpful answer. This is definitely not easy and is never going to be perfect, but I think with a little effort you can provide a better user experience doing something like this. Also, as long as you give a user a chance to review/edit the fields on a different screen if needed than this is fine.
delux247
A: 

I had to do this. Actually, something much harder than this, because sometimes the "name" would be "Smith, John" or "Smith John" instead of "John Smith", or not a person's name at all but instead a name of a company. And it had to do it automatically with no opportunity for the user to correct it.

What I ended up doing was coming up with a finite list of patterns that the name could be in, like:
Last, First Middle-Initial
First Last
First Middle-Initial Last
Last, First Middle
First Middle Last
First Last

Throw in your Mr's, Jr's, there too. Let's say you end up with a dozen or so patterns.

My application had a dictionary of common first name, common last names (you can find these on the web), common titles, common suffixes (jr, sr, md) and using that would be able to make real good guesses about the patterns. I'm not that smart, my logic wasn't that fancy, and yet still, it wasn't that hard to create some logic that guessed right more than 99% of the time.

Corey Trager
+1  A: 

If you simply have to do this, add the guesses to the UI as an optional selection. This way, you could tell the user how you parsed the name and let them pick a different parsing from a list you provide.

Omer van Kloeten
A: 

SEE MORE DISCUSSION (almost exactly 1 year ago):
http://discuss.joelonsoftware.com/default.asp?design.4.551889.41

micahwittman
A: 

I agree, there's no simple solution for this. But I found an awful approach in a Microsoft KB article for VB 5.0 that is an actual implementation to much of the discussion talked about here: http://support.microsoft.com/kb/168799

Something like this could be used in a pinch.

Shawn Miller
+1  A: 

Understanding this is a bad idea, I wrote this regex in perl - here's what worked the best for me. I had already filtered out company names.
Output in vcard format: (hon_prefix, given_name, additional_name, family_name, hon. suffix)

/^ \s*
    (?:((?:Dr.)|(?:Mr.)|(?:Mr?s.)|(?:Miss)|(?:2nd\sLt.)|(?:Sen\.?))\s+)? # prefix
    ((?:\w+)|(?:\w\.)) # first name
(?: \s+ ((?:\w\.?)|(?:\w\w+)) )?  # middle initial
(?: \s+ ((?:[OD]['’]\s?)?[-\w]+))    # last name
(?: ,? \s+ ( (?:[JS]r\.?) | (?:Esq\.?) | (?: (?:M)|(?:Ph)|(?:Ed) \.?\s*D\.?) | 
   (?: R\.?N\.?) | (?: I+) )  )? # suffix
\s* $/x

notes:

  • doesn't handle IV, V, VI
  • Hard-coded lists of prefixes, suffixes. evolved from dataset of ~2K names
  • Doesn't handle multiple suffixes (eg. MD, PhD)
  • Designed for American names - will not work properly on romanized Japanese names or other naming systems
Thelema
Could you please tell me how to add suffix support. American name support is all I am looking for. An algorithm for actual implementation will be awesome. Thank you..
ThinkCode
A: 

There is no 100% way to do this.

You can split on spaces, and try to understand the name all you want, but when it comes down to it, you will get it wrong sometimes. If that is good enough, go for any of the answers here that give you ways to split.

But some people will have a name like "John Wayne Olson", where "John Wayne" is the first name, and someone else will have a name like "John Wayne Olson" where "Wayne" is their middle name. There is nothing present in that name that will tell you which way to interpret it.

That's just the way it is. It's an analogue world.

My rules are pretty simple.

Take the last part --> Last Name
If there are multiple parts left, take the last part --> Middle name
What is left --> First name

But don't assume this will be 100% accurate, nor will any other hardcoded solution. You will need to have the ability to let the user edit this him/her-self.

Lasse V. Karlsen
A: 

don't understand

What don't you understand? The question is: how do you take a name like "John W. Smith" and turn it into "First name: John", "Middle Name: W.", and "Last Name: Smith". It's a common problem, and not one that is easy to solve - as we've seen here!
Keithius
+1  A: 

There are a few add-ins we have used in our company to accomplish this. I ended up creating a way to actually specify the formats for the name on our different imports for different clients. There is a company that has a tool that in my experience is well worth the price and is really incredible when tackling this subject. It's at: http://www.softwarecompany.com/ and works great. The most efficient way to do this w/out using any statistical approach is to split the string by commas or spaces then: 1. strip titles and prefixes out 2. strip suffixes out 3, parse name in the order of ( 2 names = F & L, 3 names = F M L or L M F) depending on order of string().

Sean Fair