views:

8711

answers:

15

I'm having to do some data conversion, and I need to try to match up on names that are not a direct match on full name. I'd like to be able to take the full name field and break it up into first, middle and last name.

The data does not include any prefixes or suffixes. The middle name is optional. The data is formatted 'First Middle Last'.

I'm interested in some practical solutions to get me 90% of the way there. As it has been stated, this is a complex problem, so I'll handle special cases specially :)

+5  A: 

Unless you have very, very well-behaved data, this is a non-trivial challenge. A naive approach would be to tokenize on whitespace and assume that a three-token result is [first, middle, last] and a two-token result is [first, last], but you're going to have to deal with multi-word surnames (e.g. "Van Buren") and multiple middle names.

Josh Millard
I ended up being naive, and it all worked out in the end. Thanks.
Even Mien
+2  A: 

Are you sure the Full Legal Name will always include First, Middle and Last? I know people that have only one name as Full Legal Name, and honestly I am not sure if that's their First or Last Name. :-) I also know people that have more than one Fisrt names in their legal name, but don't have a Middle name. And there are some people that have multiple Middle names.

Then there's also the order of the names in the Full Legal Name. As far as I know, in some Asian cultures the Last Name comes first in the Full Legal Name.

On a more practical note, you could split the Full Name on whitespace and threat the first token as First name and the last token (or the only token in case of only one name) as Last name. Though this assumes that the order will be always the same.

Franci Penov
There are also people who have only a first name. Not only celebrities like Madonna and Cher and Bono, but it's traditional in Iceland for example to go by your first name only.
Bill Karwin
This seems like the practical approach that I need to use. The middle name could be anything that is not included in the First or Last name.
Even Mien
@Bill Karwin - yep, I mentioned that if you have a person with only one name, it's not clear if it's their first or last name.
Franci Penov
A: 
  1. Get a sql regex function. Sample: http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
  2. Extract names using regular expressions.

I recommend Expresso for learnin/building/testing regular expressions. Old free version, new commercial version

Bartek Szabat
A: 

It's difficult to answer without knowing how the "full name" is formatted.

It could be "Last Name, First Name Middle Name" or "First Name Middle Name Last Name", etc.

Basically you'll have to use the SUBSTRING fuction

SUBSTRING ( expression , start , length )

And probably the CHARINDEX function

CHARINDEX (substr, expression)

To figure out the start and length for each part you want to extract.

So let's say the format is "First Name Last Name" you could (untested.. but should be close) :

SELECT 
SUBSTR(fullname, 1, CHARINDEX(' ', fullname) - 1) AS FirstName, 
SUBSTR(fullname, CHARINDEX(' ', fullname) + 1) AS LastName
FROM YourTable
neonski
I updated the data format after you posted this.
Even Mien
+3  A: 

Reverse the problem, add columns to hold the individual pieces and combine them to get the full name.

The reason this will be the best answer is that there is no guaranteed way to figure out a person has registered as their first name, and what is their middle name.

For instance, how would you split this?

Jan Olav Olsen Heggelien

This, while being fictious, is a legal name in Norway, and could, but would not have to, be split like this:

First name: Jan Olav
Middle name: Olsen
Last name: Heggelien

or, like this:

First name: Jan Olav
Last name: Olsen Heggelien

or, like this:

First name: Jan
Middle name: Olav
Last name: Olsen Heggelien

I would imagine similar occurances can be found in most languages.

So instead of trying to interpreting data which does not have enough information to get it right, store the correct interpretation, and combine to get the full name.

Lasse V. Karlsen
Unfortunately, this is data conversion. It is what it is.
Even Mien
Then you're going to have to build a simple algorithm, and just handle the errors afterwards when you become aware of them.
Lasse V. Karlsen
A: 

I'm not sure about SQL server, but in postgres you could do something like this:

SELECT 
  SUBSTRING(fullname, '(\\w+)') as firstname,
  SUBSTRING(fullname, '\\w+\\s(\\w+)\\s\\w+') as middle,
  COALESCE(SUBSTRING(fullname, '\\w+\\s\\w+\\s(\\w+)'), SUBSTRING(fullname, '\\w+\\s(\\w+)')) as lastname
FROM 
public.person

The regex expressions could probably be a bit more concise; but you get the point. This does by the way not work for persons having two double names (in the Netherlands we have this a lot 'Jan van der Ploeg') so I'd be very careful with the results.

p3t0r
+1  A: 

Like #1 said, it's not trivial. Hyphenated last names, initials, double names, inverse name sequence and a variety of other anomalies can ruin your carefully crafted function.

You could use a 3rd party library (plug/disclaimer - I worked on this product):

http://www.melissadata.com/nameobject/nameobject.htm

Marc Bernier
Hey we use Melissa data for zip codes. I didn't know you had something for names, will need to check it out.
HLGEM
+1  A: 

I would do this as an iterative process.

1) Dump the table to a flat file to work with.

2) Write a simple program to break up your Names using a space as separator where firsts token is the first name, if there are 3 token then token 2 is middle name and token 3 is last name. If there are 2 tokens then the second token is the last name. (Perl, Java, or C/C++, language doesn't matter)

3) Eyeball the results. Look for names that don't fit this rule.

4) Using that example, create a new rule to handle that exception...

5) Rinse and Repeat

Eventually you will get a program that fixes all your data.

Ben
+24  A: 

Here is a self-contained example, with easily manipulated test data.

With this example, if you have a name with more than three parts, then all the "extra" stuff will get put in the LAST_NAME field. An exception is made for specific strings that are identified as "titles", such as "DR", "MRS", and "MR".

If the middle name is missing, then you just get FIRST_NAME and LAST_NAME (MIDDLE_NAME will be NULL).

You could smash it into a giant nested blob of SUBSTRINGs, but readability is hard enough as it is when you do this in SQL.

Edit-- Handle the following special cases:

1 - The NAME field is NULL

2 - The NAME field contains leading / trailing spaces

3 - The NAME field has > 1 consecutive space within the name

4 - The NAME field contains ONLY the first name

5 - Include the original full name in the final output as a separate column, for readability

6 - Handle a specific list of prefixes as a separate "title" column

SELECT
  FIRST_NAME.ORIGINAL_INPUT_DATA
 ,FIRST_NAME.TITLE
 ,FIRST_NAME.FIRST_NAME
 ,CASE WHEN 0 = CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)
       THEN NULL  --no more spaces?  assume rest is the last name
       ELSE SUBSTRING(
                       FIRST_NAME.REST_OF_NAME
                      ,1
                      ,CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)-1
                     )
       END AS MIDDLE_NAME
 ,SUBSTRING(
             FIRST_NAME.REST_OF_NAME
            ,1 + CHARINDEX(' ',FIRST_NAME.REST_OF_NAME)
            ,LEN(FIRST_NAME.REST_OF_NAME)
           ) AS LAST_NAME
FROM
  (  
  SELECT
    TITLE.TITLE
   ,CASE WHEN 0 = CHARINDEX(' ',TITLE.REST_OF_NAME)
         THEN TITLE.REST_OF_NAME --No space? return the whole thing
         ELSE SUBSTRING(
                         TITLE.REST_OF_NAME
                        ,1
                        ,CHARINDEX(' ',TITLE.REST_OF_NAME)-1
                       )
    END AS FIRST_NAME
   ,CASE WHEN 0 = CHARINDEX(' ',TITLE.REST_OF_NAME)  
         THEN NULL  --no spaces @ all?  then 1st name is all we have
         ELSE SUBSTRING(
                         TITLE.REST_OF_NAME
                        ,CHARINDEX(' ',TITLE.REST_OF_NAME)+1
                        ,LEN(TITLE.REST_OF_NAME)
                       )
    END AS REST_OF_NAME
   ,TITLE.ORIGINAL_INPUT_DATA
  FROM
    (   
    SELECT
      --if the first three characters are in this list,
      --then pull it as a "title".  otherwise return NULL for title.
      CASE WHEN SUBSTRING(TEST_DATA.FULL_NAME,1,3) IN ('MR ','MS ','DR ','MRS')
           THEN LTRIM(RTRIM(SUBSTRING(TEST_DATA.FULL_NAME,1,3)))
           ELSE NULL
           END AS TITLE
      --if you change the list, don't forget to change it here, too.
      --so much for the DRY prinicple...
     ,CASE WHEN SUBSTRING(TEST_DATA.FULL_NAME,1,3) IN ('MR ','MS ','DR ','MRS')
           THEN LTRIM(RTRIM(SUBSTRING(TEST_DATA.FULL_NAME,4,LEN(TEST_DATA.FULL_NAME))))
           ELSE LTRIM(RTRIM(TEST_DATA.FULL_NAME))
           END AS REST_OF_NAME
     ,TEST_DATA.ORIGINAL_INPUT_DATA
    FROM
      (
      SELECT
        --trim leading & trailing spaces before trying to process
        --disallow extra spaces *within* the name
        REPLACE(REPLACE(LTRIM(RTRIM(FULL_NAME)),'  ',' '),'  ',' ') AS FULL_NAME
       ,FULL_NAME AS ORIGINAL_INPUT_DATA
      FROM
        (
        --if you use this, then replace the following
        --block with your actual table
              SELECT 'GEORGE W BUSH' AS FULL_NAME
        UNION SELECT 'SUSAN B ANTHONY' AS FULL_NAME
        UNION SELECT 'ALEXANDER HAMILTON' AS FULL_NAME
        UNION SELECT 'OSAMA BIN LADEN JR' AS FULL_NAME
        UNION SELECT 'MARTIN J VAN BUREN SENIOR III' AS FULL_NAME
        UNION SELECT 'TOMMY' AS FULL_NAME
        UNION SELECT 'BILLY' AS FULL_NAME
        UNION SELECT NULL AS FULL_NAME
        UNION SELECT ' ' AS FULL_NAME
        UNION SELECT '    JOHN  JACOB     SMITH' AS FULL_NAME
        UNION SELECT ' DR  SANJAY       GUPTA' AS FULL_NAME
        UNION SELECT 'DR JOHN S HOPKINS' AS FULL_NAME
        UNION SELECT ' MRS  SUSAN ADAMS' AS FULL_NAME
        UNION SELECT ' MS AUGUSTA  ADA   KING ' AS FULL_NAME      
        ) RAW_DATA
      ) TEST_DATA
    ) TITLE
  ) FIRST_NAME
JosephStyons
Sigh... You provide a great answer in 41 minutes and get no votes while those who quickly explain how difficult it is get upvoted. +1
Kluge
Thanks! That's worth more than an upvote to me.
JosephStyons
You totally made my day! Thanks!
TrickyNixon
Great answer but it doesn't do a good job if the full name includes prefixes (Dr., Mr., Ms.)
EfficionDave
@EfficionDave: you are quite right, it won't handle those situations well at all. Those kinds of things are why the answer by Josh Millard is also true. Parsing unruly data is a nontrivial challenge, which is why Google is able to make so much money at it.
JosephStyons
@EfficionDave: ok, so I couldn't get it off my mind until I fixed that issue. Check out the revised version; you have to manually provide a list of strings you want to consider "titles" though.
JosephStyons
@JosephStyons: sounds like a great solution to the issue.
EfficionDave
I've created a SQL Function based on JosephStyons script above that returns the First Name given the full name. http://www.efficionconsulting.com/Blog/itemid/643/amid/1500/sql-function-to-parse-first-name-from-full-name.aspx
EfficionDave
A: 

I once made a 500 character regular expression to parse first, last and middle names from an arbitrary string. Even with that honking regex, it only got around 97% accuracy due to the complete inconsistency of the input. Still, better than nothing.

A: 

Subject to the caveats that have already been raised regarding spaces in names and other anomalies, the following code will at least handle 98% of names. (Note: messy SQL because I don't have a regex option in the database I use.)

**Warning: messy SQL follows:

create table parsname (fullname char(50), name1 char(30), name2 char(30), name3 char(30), name4 char(40));
insert into parsname (fullname) select fullname from ImportTable;
update parsname set name1 = substring(fullname, 1, locate(' ', fullname)),
 fullname = ltrim(substring(fullname, locate(' ', fullname), length(fullname)))
 where locate(' ', rtrim(fullname)) > 0;
update parsname set name2 = substring(fullname, 1, locate(' ', fullname)),
 fullname = ltrim(substring(fullname, locate(' ', fullname), length(fullname)))
 where locate(' ', rtrim(fullname)) > 0;
update parsname set name3 = substring(fullname, 1, locate(' ', fullname)),
 fullname = ltrim(substring(fullname, locate(' ', fullname), length(fullname)))
 where locate(' ', rtrim(fullname)) > 0;
update parsname set name4 = substring(fullname, 1, locate(' ', fullname)),
 fullname = ltrim(substring(fullname, locate(' ', fullname), length(fullname)))
 where locate(' ', rtrim(fullname)) > 0;
// fullname now contains the last word in the string.
select fullname as FirstName, '' as MiddleName, '' as LastName from parsname where fullname is not null and name1 is null and name2 is null
union all
select name1 as FirstName, name2 as MiddleName, fullname as LastName from parsname where name1 is not null and name3 is null

The code works by creating a temporary table (parsname) and tokenizing the fullname by spaces. Any names ending up with values in name3 or name4 are non-conforming and will need to be dealt with differently.

Kluge
A: 

Here's a stored procedure that will put the first word found into First Name, the last word into Last Name and everything in between into Middle Name.

create procedure [dbo].[import_ParseName]
(            
    @FullName nvarchar(max),
    @FirstName nvarchar(255) output,
    @MiddleName nvarchar(255) output,
    @LastName nvarchar(255)  output
)
as
begin

set @FirstName = ''
set @MiddleName = ''
set @LastName = ''  
set @FullName = ltrim(rtrim(@FullName))

declare @ReverseFullName nvarchar(max)
set @ReverseFullName = reverse(@FullName)

declare @lengthOfFullName int
declare @endOfFirstName int
declare @beginningOfLastName int

set @lengthOfFullName = len(@FullName)
set @endOfFirstName = charindex(' ', @FullName)
set @beginningOfLastName = @lengthOfFullName - charindex(' ', @ReverseFullName) + 1

set @FirstName = case when @endOfFirstName <> 0 
                      then substring(@FullName, 1, @endOfFirstName - 1) 
                      else ''
                 end

set @MiddleName = case when (@endOfFirstName <> 0 and @beginningOfLastName <> 0 and @beginningOfLastName > @endOfFirstName)
                       then ltrim(rtrim(substring(@FullName, @endOfFirstName , @beginningOfLastName - @endOfFirstName))) 
                       else ''
                  end

set @LastName = case when @beginningOfLastName <> 0 
                     then substring(@FullName, @beginningOfLastName + 1 , @lengthOfFullName - @beginningOfLastName)
                     else ''
                end

return

end

And here's me calling it.

DECLARE @FirstName nvarchar(255),
     @MiddleName nvarchar(255),
     @LastName nvarchar(255)

EXEC    [dbo].[import_ParseName]
     @FullName = N'Scott The Other Scott Kowalczyk',
     @FirstName = @FirstName OUTPUT,
     @MiddleName = @MiddleName OUTPUT,
     @LastName = @LastName OUTPUT

print   @FirstName 
print   @MiddleName
print   @LastName 

output:

Scott
The Other Scott
Kowalczyk
Even Mien
A: 

As everyone else says, you can't from a simple programmatic way.

Consider these examples:

  • President "George Herbert Walker Bush" (First Middle Middle Last)

  • Presidential assassin "John Wilkes Booth" (First Middle Last)

  • Guitarist "Eddie Van Halen" (First Last Last)

  • And his mom probably calls him Edward Lodewijk Van Halen (First Middle Last Last)

  • Famed castaway "Mary Ann Summers" (First First Last)

  • New Mexico GOP chairman "Fernando C de Baca" (First Last Last Last)

Andy Lester
A: 

We of course all understand that there's no perfect way to solve this problem, but some solutions can get you farther than others.

In particular, it's pretty easy to go beyond simple whitespace-splitters if you just have some lists of common prefixes (Mr, Dr, Mrs, etc.), infixes (von, de, del, etc.), suffixes (Jr, III, Sr, etc.) and so on. It's also helpful if you have some lists of common first names (in various languages/cultures, if your names are diverse) so that you can guess whether a word in the middle is likely to be part of the last name or not.

BibTeX also implements some heuristics that get you part of the way there; they're encapsulated in the Text::BibTeX::Name perl module. Here's a quick code sample that does a reasonable job.

use Text::BibTeX;
use Text::BibTeX::Name;
$name = "Dr. Mario Luis de Luigi Jr.";
$name =~ s/^\s*([dm]rs?.?|miss)\s+//i;
$dr=$1;
$n=Text::BibTeX::Name->new($name);
print join("\t", $dr, map "@{[ $n->part($_) ]}", qw(first von last jr)), "\n";
Ken Williams
A: 

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

Jonathon Hill