tags:

views:

594

answers:

7

Hi everybody, here's my problem: I want to check if a user insert a real name and surname by checking if they have only letters (of any alphabet) and ' or - in PHP. I've found a solution here (but I don't remember the link) on how to check if a string has only letters:

preg_match('/^[\p{L} ]+$/u',$name)

but I'd like to admit ' and - too. (Charset is UTF8) Can anyone help me please?

+4  A: 

Looks like you just need to modify the regex: [\p{L}' -]+

Tomer Gabel
There is no range from `'` (U+0027) to ` ` (U+0020).
Gumbo
edit it to [\p{L}' -]+ (space and dash swapped), otherwise it's wrong as Gumbo noted. Btw. how can you be sure that only ', space and dash are allowed non-letters in names?
Johannes Weiß
Corrected, thanks.As for your question, I can't; it's a heuristic approach and simply depends on the specific requirements of the application (e.g. a purely American-targeted application probably has little need for additional character support).
Tomer Gabel
+4  A: 

(International) names can contain many characters: spaces, 's, dashes, normal letters, umlauts, accents, ...

EDIT: The point is: How to be sure all letters (of all languages), dash, ' and space are enough? Are there no names which contain a dot (What about "Dr. No"?), a colon or some char else?

EDIT2: Thanks to the user 'some' probably from Sweden (left a comment) we now know that there is an swedish name 'Andreas J:son Friberg'. Remember the colon!

Johannes Weiß
Unlauts and accents both are diacriticals; they modify letters and do not appear by themselves. E.g. "é". The question correctly takes them into account when it says "letters (of any alphabet)".Admitted: A Chinese name contains ideographs, not letters.
MSalters
\p{L} accepts all of characters that are letter. In any language.
vartec
yes, but not -, ', ...
Johannes Weiß
[\p{L}'- ]+ does it for all names I know, but how can you be sure your regex contains all characters for all names in the world?
Johannes Weiß
MSalters, you are correct! I didn't say anything which is non conformant to your comment. The thing I wanted to say is that all letters (with diacriticals and chinese ones), dash, ' and space could be not enough for match all names possible.
Johannes Weiß
I'm not sure it can be enough..do you think it can't be? "Dr. No" is not a name I think..Can you give an example of a name made up by different chars (different from letters of any alphabet,spaces, ' and -)?
unknown (yahoo), no I can't provide a sample, I only wanted to say that it could be possible that some foreign names contain more than these chars
Johannes Weiß
yes Johannes you're right. I have to do some search about this
It's not common, but some people in Sweden have a colon, like "Andreas J:son Friberg"
some
A long time ago (in a galaxy far, far away) a sons surname was his fathers name with the suffix "son" (daughters had the suffix "dotter"). "Johan Svensson" literally means "Johan, the son of Sven". This system was abandoned several hundred years ago, but the surnames is still used. (cont)
some
Mostly the -son names since the woman usually changed her surname when she got married. Some people want more than one surname, and create new combinations. The "J:son" in the example above is a short form of "Johansson", "Jansson" or some other name that started with character "J".
some
A: 

This should also do it

/[\w'-]+/gi
kRON
+2  A: 

Depending on the character set you want to permit, you'll just need to make sure that characters you want to support are inside the '[]' portion of the regex. Since the '-' character has special meaning in this context (it creates a range), it needs to be the last item in the list.

The \p{L} means match any character with the property of being a letter. \w has a similar meaning, but also includes the '_' character, which you probably don't want.

preg_match('/^[A-Za-z \'-]+$/i',$name);

Would match most common names, though if you want to support foreign character sets, you'll need more a exotic regex.

Joseph Tary
+5  A: 

A little off-topic, but what exactly is the point of validating names?

It's not to prevent fraud; if people are trying to give you a fake name, they can easily type a string of random letters.

It's not to prevent mistakes; typing a punctuation character is only one of the many mistakes you could make, and an unlikely one at that.

It's not to prevent code injection; you should be preventing that by properly encoding your outputs, regardless of what characters they contain.

So why do we all do it?

JW
I just check to see there are no digits 0-9, not empty, too short or too long (for the db field).
OIS
So Jennifer 8. Lee is not welcome on your site?
Chuck
The example you provided is right, becouse I don't want anybody to give a name like "Jennifer 8 Lee", "M@rk And€r$0n" or stuff like that...
A: 

if charset is UTF-8, then you have a problem - how are you able to check for Central and Eastern European Latin characters (diacritics) or names in Cyrillic, Chinese or Japanese names? that would be a hell of a regex.

dusoft
A: 

Note that the example you provided does not check to ensure that the user has both a surname and given names, though I would argue that that is how it should be. You shouldn't assume a person has more than one name. I am currently working on a PHP application which deals with people's names in context, and if I have discovered anything it's that you cannot make such assumptions :) Even many non-celebrities have just one name.

Using the Unicode categories as in \p{L} was a good idea, as yes obviously people will have all sorts of characters from other languages in their names. However, as well as \p{L} you will also have to take into account combining marks - ie accents, umlauts etc that people add as extra characters.

So, maybe immediately after \p{L} I'd add \p{Mc}

I'd end up with

preg_match('/^[\pL\p{Mc} \'-]+$/u', $name)
thomasrutter