views:

284

answers:

6

I'm looking for the best reliable way to return the first and last name of a person given the full name, so far the best I could think of is the following regular expression:

$name = preg_replace('~\b(\p{L}+)\b.+\b(\p{L}+)\b~i', '$1 $2', $name);

The expected output should be something like this:

William -> William // Regex Fails
William Henry -> William Henry
William Henry Gates -> William Gates

I also want it to support accents, for instance "João".

EDIT: I understand that some names will not be properly identified, but this isn't a problem for me, since this is going to be used on a local site where the last word is the last name (might not be the whole surname though) but this isn't a problem since all I want is a quick way to say "Dear FIRST_NAME LAST_NAME"... So all this discussion, while totally valid, is useless to me.

Can someone help me with this?

+4  A: 

Instead of a regex you might find it easier to do something like:

$parts = explode(" ", $name);
$first = $parts[0];
$last = ""
if (count($parts) > 1) {
    $last = $parts[count($parts) - 1];
}

You might want to replace multiple consecutive bits of whitespace with a single space first, so you don't get empty bits, and get rid of trailing/leading whitespace:

$name = ereg_replace("[ \t\r\n]+", " ", trim($name));
Dominic Rodger
+1, regexes are probably the worst solution to this problem.
DisgruntledGoat
How does explode handle empty elements - i.e. does $name need to be trimmed first? Also, `[ \t\r\n]` is a rather long-winded way of saying `\s`.
Peter Boughton
http://codepad.org/QZ5tTQPQ
Nick Presta
+1  A: 

If you're defining first and last name as the text before the first space and after the last space, then just split the string on spaces and grab the first and last elements of the array.

However, depending on the context/scope of what you're doing, you may need to re-evaluate things - not all names around the world will meet this pattern.

Peter Boughton
+6  A: 

This might not be what you want to hear, but I don't think this problem is suited to a regular expression since names are not regular. I don't think they are even context-sensitive or context-free. If anything, they are unrestricted (I would have to sit down and think that through more than I did before I say that for sure, though) and no regular expression engine can parse an unrestricted grammar.

Thomas Owens
Actually, I thought about it a little more...I don't think names are unrestricted. I don't think any formal language can be used on names.
Thomas Owens
+2  A: 

Depending on how clean your data is, I think you are going to have a tough time finding a single regex that does what you want. What different formats do you expect the names to be in? I've had to write similar code and there can be a lot of variations: - first last - last, first - first middle last - last, first middle

And then you have things like suffixes (Junior, senior, III, etc.) and prefixes ( Mr., Mrs, etc), combined names (e.g. John and Mary Smith). As some others have already mentioned you also have to deal with multi-part last names (e.g. Victor de la Hoya) as well.

I found I had to deal with all of those possibilities before I could reliably pull out the first and last names.

TLiebe
A: 

I think your best option is to simply treat everything after the first name as the surname i.e.

William Henry Gates
Forename: William
Surname: Henry Gates

Its the safest mechanism as not everyone will enter their middle name anyway. You can't simply extract William - ignore Henry - and extract Gates as for all you know, Henry is part of the Surname.

James
Taking middle names as last name is not the safest mechanism. The safest mechanism - if the parts are needed - is to explicitly asking people to provide them in separate fields.
Peter Boughton
Its the safest mechanism...given the circumstances. I would imagine if he had that option available he wouldn't be in the situation he is.
James
+1  A: 

As is, you're requiring a last name -- which, of course, your first example doesn't have.

Use clustered grouping, (?:...), and 0-or-1 count, ?, for the middle and last names as a whole to allow them to be optional:

'~\b(\p{L}+)\b (?: .+\b(\p{L}+)\b )?~ix'  # x for spacing

This should allow the first name to be captured whether middle/last names are given or not.

$name = preg_replace('~\b(\p{L}+)\b(?:.+\b(\p{L}+)\b)?~i', '$1 $2', $name);
Jonathan Lonowski
Thanks, this was exactly what I was looking for. =)
Alix Axel