tags:

views:

3740

answers:

11
+2  Q: 

Regex for names.

Hi there.

Just starting to explore the 'wonders' of regex. Being someone who learns from trial and error, I'm really struggling because my trials are throwing up a disproportionate amount of errors... My experiments are in PHP using ereg().

Anyway. I work with first and last names separately but for now using the same regex. So far I have:

^[A-Z][a-zA-Z]+$

Any length string that starts with a capital and has only letters (capital or not) for the rest. But where I fall apart is dealing with the special situations that can pretty much occur anywhere.

  • Hyphenated Names (Worthington-Smythe)
  • Names with Apostophies (D'Angelo)
  • Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.
  • Joint Names (Ben & Jerry)

Maybe there's some other way a name can be that I'm no thinking of, but I suspect if I can get my head around this, I can add to it. I'm pretty sure there will be instances where more than one of these situtations comes up in one name.

So, I think the bottom line is to have my regex also accept a space, hyphens, ampersands and apostrophes - but not at the start or end of the name to be technically correct.

A: 

if you add spaces then "He went to the market on Sunday" would be a valid name.

I don't think you can do this with a regex, you cannot easily detect names from a chunk of text using a regex, you would need a dictionary of approved names and search based on that. Any names not on the list wouldn't be detected.

Osama ALASSIRY
Oh man, where's the name change form - I'm totally changing my name to "H went to the market on Sunday".
Paul Tomblin
You can't pull names out of a body of text, but you could potentially do a match to see if a given string is a 'valid' name. Why you would bother in production is beyond me, but this isn't production, this is learning regex.
Matthew Scharley
Right, my attempt is not to find a name in a sentence or paragraph or whatever, but check for some semblance of normality.
Humpton
A: 

Give up. Every rule you can think of has exceptions in some culture or other. Even if that "culture" is geeks who like legally change their names to "37eet".

Paul Tomblin
+9  A: 
  • Hyphenated Names (Worthington-Smythe)

Add a - into the second character class. The easiest way to do that is to add it at the start so that it can't possibly be interpreted as a range modifier (as in a-z).

^[A-Z][-a-zA-Z]+$
  • Names with Apostophies (D'Angelo)

A naive way of doing this would be as above, giving:

^[A-Z][-'a-zA-Z]+$

Don't forget you may need to escape it inside the string! A 'better' way, given your example might be:

^[A-Z]'?[-a-zA-Z]+$

Which will allow a possible single apostrophe in the second position.

  • Names with Spaces (Van der Humpton) - capitals in the middle which may or may not be required is way beyond my interest at this stage.

Here I'd be tempted to just do our naive way again:

^[A-Z]'?[- a-zA-Z]+$

A potentially better way might be:

^[A-Z]'?[- a-zA-Z]( [a-zA-Z])*$

Which looks for extra words at the end. This probably isn't a good idea if you're trying to match names in a body of extra text, but then again, the original wouldn't have done that well either.

  • Joint Names (Ben & Jerry)

At this point you're not looking at single names anymore?

Anyway, as you can see, regexes have a habit of growing very quickly...

Matthew Scharley
Humpton
This doesn't handle international names. One of the comments below pointed out the use of \p{L} but you can read a lot more about unicode character classes at http://www.regular-expressions.info/unicode.html
Kimball Robinson
+1  A: 
^[A-Z][a-zA-Z '&-]*[A-Za-z]$

Will accept anything that starts with an uppercase letter, followed by zero or more of any letter, space, hyphen, ampersand or apostrophes, and ending with a letter.

Robert Gamble
This does not account for international characters.
Kimball Robinson
+4  A: 

Basically, I agree with Paul... You will always find exceptions, like di Caprio, DeVil, or such.

Remarks on your message: in PHP, ereg is generally seen as obsolete (slow, incomplete) in favor of preg (PCRE regexes).
And you should try some regex tester, like the powerful Regex Coach: they are great to test quickly REs against arbitrary strings.

If you really need to solve your problem and aren't satisfied with above answers, just ask, I will give a go.

PhiLho
Firstly, I'll add exploring preg to my list. Then, I'll investigate a tester. And, I totally accept that people like di Caprio will mess up my first musings... This does have a real use, but mostly it's a learning experience. What appeared here in minutes has given me a lot to go on.
Humpton
+1  A: 

See this question for more related "name-detection" related stuff.

http://stackoverflow.com/questions/256729/regex-to-match-a-maximum-of-4-spaces

Basically, you have a problem in that, there are effectively no characters in existence that can't form a legal name string.

If you are still limiting yourself to words without ä ü æ ß and other similar non-strictly-ascii characters.

Get yourself a copy of UTF32 character table and realise how many millions of valid characters there are that your simple regex would miss.

Kent Fredric
+3  A: 

I don't really have a whole lot to add to a regex that takes care of names because there are already some good suggestions here, but if you want a few resources for learning more about regular expressions, you should check out:

VirtuosiMedia
+3  A: 

I second the 'give up' advice. Even if you consider numbers, hyphens, apostrophes and such, something like [a-zA-Z] still wouldn't catch international names (for example, those having šđčćž, or Cyrillic alphabet, or Chinese characters...)

But... why are you even trying to verify names? What errors are you trying to catch? Don't you think people know to write their name better than you? ;) Seriously, the only thing you can do by trying to verify names is to irritate people with unusual names.

Domchi
+3  A: 

While I agree with the answers saying you basically can't do this with regex, I will point out that some of the objections (internationalized characters) can be resolved by using UTF strings and the \p{L} character class (matches a unicode "letter").

eyelidlessness
You can read more about unicode and regular expressions at http://www.regular-expressions.info/unicode.html
Kimball Robinson
+2  A: 

.+

   

Kevin
+1  A: 

This regex is perfect for me.

^([ \u00c0-\u01ffa-zA-Z'\-])+$

It works fine in php environments using preg_match(), but doesn't work everywhere.

It matches Jérémie O'Co-nor so I think it matches all UFT-8 names.

Daan