tags:

views:

606

answers:

4

I've run into a bit of a problem with a Regex I'm using for humans names.

$rexName = '/^[a-z' -]$/i';

Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?

EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...

http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches

EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";

(now which ones of these can actually be used in any hacking attempt?)

For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

+2  A: 

That's a problem with no easy general solution. The thing is that you really can't predict what characters a name could possibly contain. Probably the best solution is to define an negative character mask to exclude some special characters you really don't want to end up in a name.

You can do this using:

$regexp = "/^[^<put unwanted characters here>]+$/

sebasgo
So if I can't predict the characters, wouldn't it be better to use a regex that disallows things instead one that allows things? I could just forbid the most command characters for SQL injection and XSS attacks, which would allow things like ü.
WebDevHobo
No, don't filter for SQL keywords and similar things. That's extremely bad coding style. Instead, escape data properly. Use mysql_realescape() to prevent SQL injections and htmlentities() for XSS attacks.
sebasgo
Yes, sebasgo is right on. This is a waste of your time if you're trying to prevent SQL injections. Use functions designed for this purpose, don't reinvent the wheel :P
hobodave
"That's extremely bad coding style". That makes no sense. Adding a couple of characters to a blacklisting regex can't be called "extremely bad coding style". I can understand if you favor using the functions you mentioned, but don't go saying that people who do different have an "extremely bad coding style"
WebDevHobo
If you filter SQL keywords, the poor Bobby Tables will not be able to attend school.
Stefano Borini
Also, the mysql_real_escape_string() function simply turns potentially harmful code into a dud. It makes it harmless... but it's still entered into the database and I don't want that. Imagine that on a profile site where a user profile display a whole bunch of SQL code...
WebDevHobo
I read XKCD, Stefano.Also, I'm not filtering keywords at all, I'm filtering symbols. A hacker can put in as many keywords as he/she wants, if the ; symbol is found in the string, then it won't matter, it'll be refused.
WebDevHobo
+2  A: 

PHP’s PCRE implementation supports Unicode character properties that span a larger set of characters. So you could use a combination of \p{L} (letter characters), \p{P} (punctuation characters) and \p{Zs} (space separator characters):

/^[\p{L}\p{P}\p{Zs}]+$/

But there might be characters that are not covered by these character categories while there might be some included that you don’t want to be allowed.

So I advice you against using regular expressions on a datum with such a vague range of values like a real person’s name.


Edit   As you edited your question and now see that you just want to prevent certain code injection attacks: You should better escape those characters rather than rejecting them as a potential attack attempt.

Use mysql_real_escape_string or prepared statements for SQL queries, htmlspecialchars for HTML output and other appropriate functions for other languages.

Gumbo
Your second link is a 404
WebDevHobo
I don't only want to prevent code injection, I don't even want escaped code to ever enter the database. As I already stated: imagine a user-profile website(myspace for instance). Imagine coming across a profile ridden with SQL injections. All of them escaped... What the hell kind of service is that? Why would I allow hackers to fill my database with useless dribble like that, when the only thing they're trying to do is hack my website?
WebDevHobo
+4  A: 

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?

Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.

Here are a few examples of rules that come to mind :

  • no number
  • no special character, like "~{()}@^$%?;:/*§£ø and probably some others
  • no more that 3 spaces ?
  • none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
    • (but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)

Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )


And, to answer a comment you left under one other answer :

I could just forbid the most command characters for SQL injection and XSS attacks,

About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.

Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)


EDIT : if you just use that regex like that, it will not work quite well :

The following code :

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will get you at least a warning :

Warning: preg_match() [function.preg-match]: Unknown modifier '{'

You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)

If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

(This is a quick and dirty proposition, which has to be refined!)

This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will say "bad name"

But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !


Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

Pascal MARTIN
Yes, it will be escaped... but still entered into the database. I wouldn't like it if there were a couple hundred profiles on my website displaying nothing but a bunch of SQL code...
WebDevHobo
In this case, it might be interesting to add some words like "select", "update", "delete", "where", "order by" and such stuff to the blacklist of forbidden words ; afterall, it is almost certain that they are not used in names ;-) ; And you might also want to ensure that a user cannot register too many times (a -- not necessarily the best one -- quite basic idea might be to set a limit on the number of registrations that can come from a single IP adresse in one hour, for instance)
Pascal MARTIN
Updated the original post with rexSafety variable.
WebDevHobo
I've edited my answer to add some stuff. As a sidenote, a name like "hello ;-{}" seems, to my eyes, nicer than a name like "this is fucking shit" -- your idea would reject the first one, and let the second one pass ? (sorry about the bad words -- wanted to show some kind of "real" example ; note it doesn't reflect any opinion)
Pascal MARTIN
Perhaps a better question to ask myself is: which characters do hackers ALWAYS need? For instance, I can allow the single quote and minus sign, but I will forbid = @ and ; The idea being a string meant to get past the security, will never be a single character. So it's a process of elimination: what is commonplace in human names and what is not. I don't need to forbid the ' character, since it will always be in the company of a @ or = sign. That's not 100% true, but I hope you see what I'm getting at.
WebDevHobo
The only problem with not allowing symbols: http://en.wikipedia.org/wiki/Prince_%28musician%29
Thomas Owens
A: 

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

Jonathon Hill