ansaurus

Question

Regular expression for validating names and surnames?

Answer 1

+4 A:

I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name

EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on it's own. However, this article has a section on filtering that is a starting point if you want to go that route.

http://tldp.org/HOWTO/Secure-Programs-HOWTO/cross-site-malicious-content.html

s/[\<\>\"\'\%\;\(\)\&\+]//g;

kscott 2009-05-20 16:09:13

What should I exclude, exactly?

Sklivvz 2009-05-20 16:11:05

Well beyond sanitizing the input, I don't see a reason to eliminate any characters. What are you trying to prevent?

kscott 2009-05-20 16:14:03

Any characters that you can be sure wont end up in a name. Since people really can be named anything nothing is safe to some extent. But I think the examples given by kscott !@#$%^ are a good place to start. You could easily run a large name list through your expression when your done and see what falls out (if any). +1

Copas 2009-05-20 16:14:14

Sorry, are you guys familiar with the Unicode character set? There would be probably the same amounts of things to include or exclude:see http://unicode.org/charts/ and http://unicode.org/charts/symbols.html

Sklivvz 2009-05-20 16:17:02

!@#$%^ won't cut it for XSS for example... http://ha.ckers.org/xss.html

Sklivvz 2009-05-20 16:22:25

No regex is going to prevent a cross site scripting attack

kscott 2009-05-20 16:36:52

Did you actually read the article you posted to? I quote:If you're already validating your input for valid characters (and you generally should), this is easily done by simply omitting the special characters from the list of valid characters. Here's an example in Perl of a filter that only accepts legal characters, and since the filter doesn't accept any special characters other than the space, it's quite acceptable for use in areas such as a quoted attribute: # Accept only legal characters: $summary =~ tr/A-Za-z0-9\ \.\://dc;---Except obviously this doesn't work with Unicode ranges...

Sklivvz 2009-05-20 16:45:46

I think you're answering your own question, Skliwz, you're not going to find a regex that covers all unicode characters and prevents cross site scripting. If stopping XSS was a simple as finding a magic regex, a lot of us would be out of jobs.

kscott 2009-05-20 16:49:11

Am I? And I thought I only asked what part of the unicode character set should be allowed in validating a name... poor me.

Sklivvz 2009-05-20 16:50:29

Then you need to decide what you're trying to prevent. If its XSS you only need to stop malicious characters, and even then you need to be doing more. If its people from entering names you don't like, then you're SOL you'll never get a regex that handles every name in every culture. Heck, you probably could even get one that handled American Hippies.

kscott 2009-05-20 17:09:35

Answer 2

+7 A:

I actually wouldn't bother.

No matter what regex you come up with, I can find a name somewhere in the world that will break it.

That being said, you do need to sanitize input, to avoid the Little Bobby Tables problem.

chris 2009-05-20 16:12:24

Hehe, funny... Good luck though in sanitizing against XSS, see: http://ha.ckers.org/xss.html

Sklivvz 2009-05-20 16:13:56

Actually I'd allow Bobby to enter his name; I'd just make sure it was escaped before I sent it to the database. Similarly I'd allow Mr "><script>alert("XSS");//</script> to have his given name, and I'd escape it before sending it to the browser. I'd only sanitise the input if I thought my colleagues might screw up the escaping.

user9876 2009-05-20 16:16:34

That's exactly my problem. I am certain they will screw up.

Sklivvz 2009-05-20 16:19:45

@Skliwz — Then that's what you need to address. If they don't properly escape when inserting into SQL, any name with an apostrophe (which your original question already recognizes as necessary) opens you up to security vulnerabilities. Imagine trying to authenticate a user nameed "Foo'or True Or'foo" — no "dangerous" characters, but there goes your login scheme.

Ben Blank 2009-05-20 16:28:30

I can't address that problem, I need to do what I can to avoid. Is there a particular reason why I should allow dangerous content in my db?

Sklivvz 2009-05-20 16:49:20

If all you're doing is reading and writing to the db, then properly parameterizing queries should take care of the problem. However, if you're ever going to dynamically execute code from the db, then you have to be careful (such as using the exec() statement.

chris 2009-05-20 17:54:40

Answer 3

+4 A:

I would just allow everything (except an empty string) and assume the user knows what his name is.

There are 2 common cases:

You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.

In case (1), you can allow all characters because you're checking against a paper document.

In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".

user9876 2009-05-20 16:13:13

+1 — No regex in the world matches subversive intent.

Ben Blank 2009-05-20 16:22:04

+1 Using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name

kscott 2009-05-20 16:23:45

Answer 4

A:

I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.

Gumbo 2009-05-20 16:13:55

I don't care whether it's fake or if someone wants to be called "stupid moron"... ;-)I need to sanitize for XSS (which is way harder than one could assume)

Sklivvz 2009-05-20 16:18:58

You just need to escape the contextual meta characters.

Gumbo 2009-05-20 16:23:21

in english? :-P

Sklivvz 2009-05-20 16:30:22

Gumbo 2009-05-20 18:14:15

I can't rely on all clients to do that -- I'm writing an API for a series of different clients, some of which are actually closed source.

Sklivvz 2009-05-20 18:35:03

It’s YOUR job to do that on the server side and not the client’s. Remember: Never trust user data!

Gumbo 2009-05-20 18:40:45

MY job is to ensure the data is the best possible. That does include the escaping, but also validating the names for "sane" values. I can't possibly validate names for existence, but for sure a name should never contain, say, mathematical symbols.The escaping you suggest would still allow for XSS names -- think about an "id10t" that HTML-decodes the string before putting it in the page. Those kinds of "names" should be caught before going in the database.I totally agree with you for other kinds of text where there are no such rules. In those cases I would do as you suggest.

Sklivvz 2009-05-21 06:22:30

Well it seem’s that you didn’t understand what XSS exactly is or what its fundamental flaw is. It’s changing from one context, in which a certain value is considered as safe, into another, in which the same isn’t considered as safe. And that change is initiated by the value itself as it contains particular character sequences that mark the end of the one and the start of the other context. Just like the `"` marks the end/begin of a string declaration. Now if you want to put a string into another string declration, you need to escape those character sequences to get them be treated as literals.

Gumbo 2009-05-21 07:11:30

So it suffices if you just escape the language and context dependent meta characters (those with the special meaning in that language and context) to get them be treated as literals and not as meta characters.

Gumbo 2009-05-21 07:13:33

Answer 5

A:

Forget about regex. Actually, I'd even say forget about such validation: There's no rules for names.

If you really really want to do something, it would be to check that UnicodeCategory is space, non-spacing marks (aka accents) and alphanumerical, in order to exclude very weird stuff. But even then! you'll probably cause more harm than solve problems.

Serge - appTranslator 2009-05-20 16:15:23

Answer 6

A:

I doubt that this is feasible - there are just to much Unicode symbols to exclude all unwanted symbols (and how will tell you what Chinese symbols to exclude?) and there are surly to many valid symbols to inlcude them all (and you will have Chinese symbols problem again).

I would not put any constraints on a user name - it may even contain numbers; think of aristocratic names.

Daniel Brückner 2009-05-20 16:15:57

Answer 7

+2 A:

Here you go: /.+/. That validates all known names assuming your regex engine can handle UNICODE characters.

Chas. Owens 2009-05-20 16:16:58

+1 For excluding new line character sequences.

Gumbo 2009-05-20 16:18:16

@Gumbo I actually worried about that. There are some screwed up people out there. I can see someone changing his or her name to "foo\nbar" and writing his or her first name on two lines.

Chas. Owens 2009-05-20 16:39:15

Answer 8

A:

BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?

As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.

John Saunders 2009-05-20 16:18:10

I need chinese and japanese at the very least.Why shouldn't I even TRY?

Sklivvz 2009-05-20 16:48:02

Try to do what? Do you know the rules for naming in those languages? Do you know how to distinguish between a first name and last name in those languages? Don't parse the names at all - just accept that people know their names.

John Saunders 2009-05-20 16:56:53

because validating a name is not how you prevent cross site scripting. you allow the users to put whatever they want in the field, since names are crazy and there are a lot of unicode characters in the world, then you treat whatever anyone puts in that field like its radioactive.

kscott 2009-05-20 16:57:46

Answer 9

A:

As many have said above there are no rules for names and chris said he could find a name in the world that could break the regex. Take Jennifer 8 Lee for example. The best way to go is just to sanitize the input.

Alexander Kahoun 2009-05-20 16:20:26

how? I thought that was my question in the first place.

Sklivvz 2009-05-20 16:47:04

I just read one of your previous comments, is your goal to prevent XSS?

Alexander Kahoun 2009-05-20 17:30:15

If so check out the answer to this question: http://stackoverflow.com/questions/24723/best-regex-to-catch-xss-cross-site-scripting-attack-in-java hopefully it's useful for you.

Alexander Kahoun 2009-05-20 17:33:04

Answer 10

A:

It's a very difficult problem to validate something like a name due to all the corner cases possible.

Corner Cases

Anything anything here

Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.

If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.

Trampas Kirk 2009-05-20 17:35:55

Answer 11

A:

I'll try to give a proper answer myself:

The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.

Regarding numbers, there's only one case with an 8. I think I can safely disallow that.

Regarding letters, any letter is valid.

I also want to include space.

This would sum up to this regex:

^[\p{L} \.'\-]+$

This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.

So the validation code should be something like this (untested):

var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$")) 
    throw new ArgumentException("nameParam");
name = name.Replace("'", "&#39;");  //&apos; does not work in IE

Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?

complete tested solution

using System;
using System.Text.RegularExpressions;

namespace test
{
    class MainClass
    {
     public static void Main(string[] args)
     {
      var names = new string[]{"Hello World", 
       "John",
       "João",
       "タロウ",
       "やまだ",
       "山田",
       "先生",
       "мыхаыл",
       "Θεοκλεια",
       "आकाङ्क्षा",
       "علاء الدين",
       "אַבְרָהָם",
       "മലയാളം",
       "상",
       "D'Addario",
       "John-Doe",
       "P.A.M.",
       "' --",
       "<xss>",
       "\""
      };
      foreach (var nameParam in names)
      {
       Console.Write(nameParam+" ");
       var name = nameParam.Trim();
       if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
       {
        Console.WriteLine("fail");
        continue;
       }
       name = name.Replace("'", "&#39;");
       Console.WriteLine(name);
      }
     }
    }
}

Sklivvz 2009-05-20 19:03:21

Sorry, you're still going to leave valid names out in the cold. I strongly suggest you read up on diacritics in Arabic, especially those are separate Unicode characters but which combine with letters to change them. Will you be disallowing things like "John W. Saunders, 3rd"? I hope not. It's just a much wider world out there than you seem to realize, and your simplistic, Western-oriented rules will simply not work in general.

John Saunders 2009-05-21 01:30:51

Hi John, the regex does support diacritics (arabic is also in the test cases) with the \p{M}.Moreover, I am only validating names, i.e. in your example those would be "John W." (or "John" and "W.") and "Saunders". "," is not part of the name and "3rd" is a suffix.

Sklivvz 2009-05-21 06:11:40

You expect the users to enter FirstName, LastName, Suffix???? Or will you also have Prefix, MiddleName1, MiddleName2 .... There's another Question about names that discussed these issues extensively.

Osama ALASSIRY 2009-05-24 00:15:44

ansaurus

tags:

views:

answers:

Regular expression for validating names and surnames?

Corner Cases

related questions