views:

5033

answers:

11

Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

  • other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
  • different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
  • no numbers or symbols or unnecessary punctuation or runes, etc..

Is there a standard way of validating these fields I can implement to make sure that our website visitors have a great experience and can actually use their name when registering?

I would be looking for something similar to the many "email address" regexes that you can find on google.


For the sake of clarity, I don't need one single regex for the "whole" name. I would expect users to be able to split their name in the two main constituents according to their customs, and not to use suffixes and titles -- which could be contained in other fields if need be.

The main purpose of the question is to validate against XSS and SQL-injection (yes, I already use stored procedures, but I need to future- and idiot-proof the data).

The way any XSS filter will work is by only allowing what is strictly necessary -- not by disallowing known XSS vectors (i.e. disallowing "script", "<", etc...). To get an idea of the incredible variety of attacks that can be used, take a look here: http://ha.ckers.org/xss.html.

Sorry for not mentioning this before, and thus making the question a bit more misterious, but I didn't want to read 30 answers translitterating "disallow the < or > and you are safe!".


See here for a good starting point on Unicode character classes in C# Regexes -- which of these are strictly necessary for writing a name? I honestly have no idea of which, but possibly the collective mind of stackoverflow can help?

(I am prepared to force people like Jennifer 8 Lee to write their name in letters ;-)


So, I did "bother" to do it myself, because I think nobody else even tried. Guess what? Apparently I did find a proper answer, posted below! It wasn't that hard.

Can you help me find a valid, existing name or a XSS vector that can break that validation?

+4  A: 

I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name

EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on it's own. However, this article has a section on filtering that is a starting point if you want to go that route.

http://tldp.org/HOWTO/Secure-Programs-HOWTO/cross-site-malicious-content.html

s/[\<\>\"\'\%\;\(\)\&\+]//g;
kscott
What should I exclude, exactly?
Sklivvz
Well beyond sanitizing the input, I don't see a reason to eliminate any characters. What are you trying to prevent?
kscott
Any characters that you can be sure wont end up in a name. Since people really can be named anything nothing is safe to some extent. But I think the examples given by kscott !@#$%^ are a good place to start. You could easily run a large name list through your expression when your done and see what falls out (if any). +1
Copas
Sorry, are you guys familiar with the Unicode character set? There would be probably the same amounts of things to include or exclude:see http://unicode.org/charts/ and http://unicode.org/charts/symbols.html
Sklivvz
!@#$%^ won't cut it for XSS for example... http://ha.ckers.org/xss.html
Sklivvz
No regex is going to prevent a cross site scripting attack
kscott
Did you actually read the article you posted to? I quote:If you're already validating your input for valid characters (and you generally should), this is easily done by simply omitting the special characters from the list of valid characters. Here's an example in Perl of a filter that only accepts legal characters, and since the filter doesn't accept any special characters other than the space, it's quite acceptable for use in areas such as a quoted attribute: # Accept only legal characters: $summary =~ tr/A-Za-z0-9\ \.\://dc;---Except obviously this doesn't work with Unicode ranges...
Sklivvz
I think you're answering your own question, Skliwz, you're not going to find a regex that covers all unicode characters and prevents cross site scripting. If stopping XSS was a simple as finding a magic regex, a lot of us would be out of jobs.
kscott
Am I? And I thought I only asked what part of the unicode character set should be allowed in validating a name... poor me.
Sklivvz
Then you need to decide what you're trying to prevent. If its XSS you only need to stop malicious characters, and even then you need to be doing more. If its people from entering names you don't like, then you're SOL you'll never get a regex that handles every name in every culture. Heck, you probably could even get one that handled American Hippies.
kscott
+7  A: 

I actually wouldn't bother.

No matter what regex you come up with, I can find a name somewhere in the world that will break it.

That being said, you do need to sanitize input, to avoid the Little Bobby Tables problem.

chris
Hehe, funny... Good luck though in sanitizing against XSS, see: http://ha.ckers.org/xss.html
Sklivvz
Actually I'd allow Bobby to enter his name; I'd just make sure it was escaped before I sent it to the database. Similarly I'd allow Mr "><script>alert("XSS");//</script> to have his given name, and I'd escape it before sending it to the browser. I'd only sanitise the input if I thought my colleagues might screw up the escaping.
user9876
That's exactly my problem. I am certain they will screw up.
Sklivvz
@Skliwz — Then that's what you need to address. If they don't properly escape when inserting into SQL, any name with an apostrophe (which your original question already recognizes as necessary) opens you up to security vulnerabilities. Imagine trying to authenticate a user nameed "Foo'or True Or'foo" — no "dangerous" characters, but there goes your login scheme.
Ben Blank
I can't address that problem, I need to do what I can to avoid. Is there a particular reason why I should allow dangerous content in my db?
Sklivvz
If all you're doing is reading and writing to the db, then properly parameterizing queries should take care of the problem. However, if you're ever going to dynamically execute code from the db, then you have to be careful (such as using the exec() statement.
chris
+4  A: 

I would just allow everything (except an empty string) and assume the user knows what his name is.

There are 2 common cases:

  1. You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
  2. You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.

In case (1), you can allow all characters because you're checking against a paper document.

In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".

user9876
+1 — No regex in the world matches subversive intent.
Ben Blank
+1 Using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name
kscott
A: 

I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.

Gumbo
I don't care whether it's fake or if someone wants to be called "stupid moron"... ;-)I need to sanitize for XSS (which is way harder than one could assume)
Sklivvz
You just need to escape the contextual meta characters.
Gumbo
in english? :-P
Sklivvz
Gumbo
I can't rely on all clients to do that -- I'm writing an API for a series of different clients, some of which are actually closed source.
Sklivvz
It’s YOUR job to do that on the server side and not the client’s. Remember: Never trust user data!
Gumbo
MY job is to ensure the data is the best possible. That does include the escaping, but also validating the names for "sane" values. I can't possibly validate names for existence, but for sure a name should never contain, say, mathematical symbols.The escaping you suggest would still allow for XSS names -- think about an "id10t" that HTML-decodes the string before putting it in the page. Those kinds of "names" should be caught before going in the database.I totally agree with you for other kinds of text where there are no such rules. In those cases I would do as you suggest.
Sklivvz
Well it seem’s that you didn’t understand what XSS exactly is or what its fundamental flaw is. It’s changing from one context, in which a certain value is considered as safe, into another, in which the same isn’t considered as safe. And that change is initiated by the value itself as it contains particular character sequences that mark the end of the one and the start of the other context. Just like the `"` marks the end/begin of a string declaration. Now if you want to put a string into another string declration, you need to escape those character sequences to get them be treated as literals.
Gumbo
So it suffices if you just escape the language and context dependent meta characters (those with the special meaning in that language and context) to get them be treated as literals and not as meta characters.
Gumbo
A: 

Forget about regex. Actually, I'd even say forget about such validation: There's no rules for names.

If you really really want to do something, it would be to check that UnicodeCategory is space, non-spacing marks (aka accents) and alphanumerical, in order to exclude very weird stuff. But even then! you'll probably cause more harm than solve problems.

Serge - appTranslator
A: 

I doubt that this is feasible - there are just to much Unicode symbols to exclude all unwanted symbols (and how will tell you what Chinese symbols to exclude?) and there are surly to many valid symbols to inlcude them all (and you will have Chinese symbols problem again).

I would not put any constraints on a user name - it may even contain numbers; think of aristocratic names.

Daniel Brückner
+2  A: 

Here you go: /.+/. That validates all known names assuming your regex engine can handle UNICODE characters.

Chas. Owens
+1 For excluding new line character sequences.
Gumbo
@Gumbo I actually worried about that. There are some screwed up people out there. I can see someone changing his or her name to "foo\nbar" and writing his or her first name on two lines.
Chas. Owens
A: 

BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?

As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.

John Saunders
I need chinese and japanese at the very least.Why shouldn't I even TRY?
Sklivvz
Try to do what? Do you know the rules for naming in those languages? Do you know how to distinguish between a first name and last name in those languages? Don't parse the names at all - just accept that people know their names.
John Saunders
because validating a name is not how you prevent cross site scripting. you allow the users to put whatever they want in the field, since names are crazy and there are a lot of unicode characters in the world, then you treat whatever anyone puts in that field like its radioactive.
kscott
A: 

As many have said above there are no rules for names and chris said he could find a name in the world that could break the regex. Take Jennifer 8 Lee for example. The best way to go is just to sanitize the input.

Alexander Kahoun
how? I thought that was my question in the first place.
Sklivvz
I just read one of your previous comments, is your goal to prevent XSS?
Alexander Kahoun
If so check out the answer to this question: http://stackoverflow.com/questions/24723/best-regex-to-catch-xss-cross-site-scripting-attack-in-java hopefully it's useful for you.
Alexander Kahoun
A: 

It's a very difficult problem to validate something like a name due to all the corner cases possible.

Corner Cases

Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.

If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.

Trampas Kirk
A: 

I'll try to give a proper answer myself:

The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.

Regarding numbers, there's only one case with an 8. I think I can safely disallow that.

Regarding letters, any letter is valid.

I also want to include space.

This would sum up to this regex:

^[\p{L} \.'\-]+$

This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.

So the validation code should be something like this (untested):

var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$")) 
    throw new ArgumentException("nameParam");
name = name.Replace("'", "&#39;");  //&apos; does not work in IE

Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?


complete tested solution

using System;
using System.Text.RegularExpressions;

namespace test
{
    class MainClass
    {
     public static void Main(string[] args)
     {
      var names = new string[]{"Hello World", 
       "John",
       "João",
       "タロウ",
       "やまだ",
       "山田",
       "先生",
       "мыхаыл",
       "Θεοκλεια",
       "आकाङ्क्षा",
       "علاء الدين",
       "אַבְרָהָם",
       "മലയാളം",
       "상",
       "D'Addario",
       "John-Doe",
       "P.A.M.",
       "' --",
       "<xss>",
       "\""
      };
      foreach (var nameParam in names)
      {
       Console.Write(nameParam+" ");
       var name = nameParam.Trim();
       if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
       {
        Console.WriteLine("fail");
        continue;
       }
       name = name.Replace("'", "&#39;");
       Console.WriteLine(name);
      }
     }
    }
}
Sklivvz
Sorry, you're still going to leave valid names out in the cold. I strongly suggest you read up on diacritics in Arabic, especially those are separate Unicode characters but which combine with letters to change them. Will you be disallowing things like "John W. Saunders, 3rd"? I hope not. It's just a much wider world out there than you seem to realize, and your simplistic, Western-oriented rules will simply not work in general.
John Saunders
Hi John, the regex does support diacritics (arabic is also in the test cases) with the \p{M}.Moreover, I am only validating names, i.e. in your example those would be "John W." (or "John" and "W.") and "Saunders". "," is not part of the name and "3rd" is a suffix.
Sklivvz
You expect the users to enter FirstName, LastName, Suffix???? Or will you also have Prefix, MiddleName1, MiddleName2 .... There's another Question about names that discussed these issues extensively.
Osama ALASSIRY