views:

293

answers:

7

Im building a new web-app, LAMP environment... im wondering if preg_match can be trusted for user's input validation (+ prepared stmt, of course) for all the text-based fields (aka not html fields; phone, name, surname, etc..).

For example, for a classic 'email field', if i check the input like:

$email_pattern = "/^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)" .
    "|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}" .
    "|[0-9]{1,3})(\]?)$/";

$email = $_POST['email'];
if(preg_match($email_pattern, $email)){
    //go on, prepare stmt, execute, etc...
}else{
    //email not valid! do nothing except warn the user
}

can i sleep easy against the sql/xxs injection?

I write the regexp to be the more restrictive as they can.

EDIT: as already said, i do use prepared statements already, and this behavior is just for text-based fields (like phone, emails, name, surname, etc..), so nothing that is allowed to contain HTML (for html fields, i use htmlpurifier).

Actually, my mission is to let pass the input value only if it match my regexp-white-list; else, return it back to the user.

p.s:: im looking for something without mysql_real_escape_strings; probably the project will switch to Postgresql in the next future, so need a validation methods that is cross-database ;)

+1  A: 

You still want to escape the data before inserting it into a database. Although validating the user input is a smart thing to do the best protection against SQL injections are prepared statements (which automatically escape data) or escaping it using the database's native escaping functionality.

John Conde
+1  A: 

NO.

NOOOO.

NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO.

DO. NOT. USE. REGEX. FOR. THIS. EVER.

http://stackoverflow.com/questions/45093/regex-to-detect-sql-injection

http://stackoverflow.com/questions/1812891/java-escape-string-to-prevent-sql-injection

Matt Ball
Because if you use regex for this, Bears will eat you.
Billy ONeal
Velociraptors too. http://xkcd.com/292/
Matt Ball
@Billy and demons will fly out of your nose.
Neil Aitken
He already said he was using prepared statements.
Adam Backstrom
@Adam: where did he say that?
Matt Ball
@bears: "im wondering if preg_match can be trusted for user's input validation (**+ prepared stmt, of course**)" ;) My bad, wasnt clear enaught
DaNieL
Ah, ok. In that case, my answer was a bit extreme.
Matt Ball
+1  A: 

There is the php function mysql_real_escape_string(), which I believe you should use before submitting into a mysql database to be safe. (Also, it is easier to read.)

Fletcher Moore
+4  A: 

For SQL injection, you should always use proper escaping like mysql_real_escape_string. The best is to use prepared statements (or even an ORM) to prevent omissions. You already did those.

The rest depends on your application's logic. You may filter HTML along with validation because you need correct information, but I don't do validation to protect from XSS, I only do business validation*.

General rule is "filter/validate input, escape output". So I escape what I display (or transmit to third-party) to prevent HTML tags, not what I record.

* Still, a person's name or email address shouldn't contain < >

streetpc
As i wrote, i already use prepared stmt; i dont see how this answer can be usefull.. sorry.
DaNieL
My bad, sorry. A prepared statement should be enough of itself to prevent SQL injection.
streetpc
+5  A: 

Whether or not a regular expression suffices for filtering depends on the regular expression. If you're going to use the value in SQL statements, the regular expression must in some way disallow ' and ". If you want to use the value in HTML output and are afraid of XSS, you'll have to make sure your regex doesn't allow <, > and ".

Still, as has been repeatedly said, you do not want to rely on regular expressions, and please by the love of $deity, don't! Use mysql_real_escape_string() or prepared statements for your SQL statements, and htmlspecialchars() for your values when printed in HTML context.

Pick the sanitising function according to its context. As a general rule of thumb, it knows better than you what is and what isn't dangerous.


Edit, to accomodate for your edit:

Database

Prepared statements == mysql_real_escape_string() on every value to put in. Essentially exactly the same thing, short of having a performance boost in the prepared statements variant, and being unable to accidentally forget using the function on one of the values. Prepared statement are what's securing you against SQL injection, rather than the regex, though. Your regex could be anything and it would make no difference to the prepared statement.

You cannot and should not try to use regexes to accodomate for 'cross-database' architecture. Again, typically the system knows better what is and isn't dangerous for it than you do. Prepared statements are good and if those are compatible with the change, then you can sleep easy. Without regexes.

If they're not and you must, use an abstraction layer to your database, something like a custom $db->escape() which in your MySQL architecture maps to mysql_real_escape_string() and in your PostgreSQL architecture maps to a respective method for PostgreSQL (I don't know which that would be off-hand, sorry, I haven't worked with PostgreSQL).

HTML

HTML Purifier is a good way to sanitise your HTML output (providing you use it in whitelist mode, which is the setting it ships with), but you should only use that on things where you absolutely need to preserve HTML, since calling a purify() is quite costly, since it parses the whole thing and manipulates it in ways aiming for thoroughness and via a powerful set of rules. So, if you don't need HTML to be preserved, you'll want to use htmlspecialchars(). But then, again, at this point, your regular expressions would have nothing to do with your escaping, and could be anything.

Security sidenote

Actually, my mission is to let pass the input value only if it match my regexp-white-list; else, return it back to the user.

This may not be true for your scenario, but just as general information: The philosophy of 'returning bad input back to the user' runs risk of opening you to reflected XSS attacks. The user is not always the attacker, so when returning things to the user, make sure you escape it all the same. Just something to keep in mind.

pinkgothic
`Pick the sanitising function according to its context`. Totally agree. I do `(int)$_POST['int-value'];` when i aspect integers, and HTMLpurifier when need to store html. But im a little bit curious about htmlspecialchars(). Lets say someone thy an xss attack in the 'name' field; a regexp will stop it if he insert html, or not? This is my real question; i feel like i cant trust at 100% of regexp, but.. why not?
DaNieL
Because regexes are tricky. You may not recognize a hole in a complex regex.
streetpc
@DaNiel: streetpc has it right. The regex may quite possibly suffice, but it's just not good practise. Heck, *even if it suffices now*, and you *know* it does, it might just take it being adjusted by someone else later on, thinking that it's meant as a sanity check rather than *escaping*, and maybe for some reason dangerous characters end up being allowed, because the boss said so. Regexes just aren't meant to semantically convey 'this is being escaped'. [BTW, I expanded my answer following your edit.]
pinkgothic
@pinkgothic: this is a good point that i didnt think about; someone of my collegues can change some regexp without knowing that theyre used as sanitization method.
DaNieL
+2  A: 

Validation is to do with making input data conform to the expected values for your particular application.

Injections are to do with taking a raw text string and putting it into a different context without suitable Escaping.

They are two completely separate issues that need to be looked at separately, at different stages. Validation needs to be done when input is read (typically at the start of the script); escaping needs to be done at the instant you insert text into a context like an SQL string literal, HTML page, or any other context where some characters have out-of-band meanings.

You shouldn't conflate these two processes and you can't handle the two issues at the same time. The word ‘sanitization’ implies a mixture of both, and as such is immediately suspect in itself. Inputs should not be ‘sanitized’, they should be validated as appropriate for the application's specific needs. Later on, if they are dumped into an HTML page, they should be HTML-escaped on the way out.

It's a common mistake to run SQL- or HTML-escaping across all the user input at the start of the script. Even ‘security’-focused tutorials (written by fools) often advise doing this. The result is invariably a big mess — and sometimes still vulnerable too.

With the example of a phone number field, whilst ensuring that a string contains only numbers will certainly also guarantee that it could not be used for HTML-injection, that's a side-effect which you should not rely on. The input stage should only need to know about telephone numbers, and not which characters are special in HTML. The HTML template output stage should only know that it has a string (and thus should always call htmlspecialchars() on it), without having to have the knowledge that it contains only numbers.

Incidentally, that's a really bad e-mail validation regex. Regex isn't a great tool for e-mail validation anyway; to do it properly is absurdly difficult, but this one will reject a great many perfectly valid addresses, including any with + in the username, any in .museum or .travel or any of the IDNA domains. It's best to be liberal with e-mail addresses.

bobince
+1  A: 

If you are good with regular expression : yes. But reading your email validation regexp, I'd have to answer no.

The best is to use filter functions to get the user inputs relatively safely and get your php up to date in case something broken is found in these functions. When you have your raw input, you have to add some things depending on what you do with these data : remove \n and \r for email and http headers, remove html tags to display to users, use parameterized queries to use it with a database.

Arkh
I hate that. Arkh pointes something interesting here, can the one whom downvoted here say WHY it consider this answer not correct, or not usefull?
DaNieL
I suppose it was for the first 2 sentences. I felt obliged to punish him for that judgement too. Your comment stopped me :P +1
naugtur