views:

33

answers:

3

I am always thinking about validation in any kind on the webpage (PHP or ASP, it doesn't matter), but never find a good and accurate answer.

For example, a I have some GET-Parameter, which defines a SQL query like DESC oder ASC. (SQL-Injection?)

Or I have a comment-function for user, where the data is also saved in a database.

Is it enought to check for HTML-tags inside the data? Should the validation done before adding it to the database or showing it on the page?

I am searching for the ToDo's which should be always performed with any data given from "outside".

Thanks.

+1  A: 

If you're using PDO, be sure to use prepared statements - these clean the incoming data automatically.

If using the mysql_* functions, run each variable through mysql_real_escape_string first.

You can also do validation such as making sure the variable is one of an acceptable range:

$allowed_values = array('name', 'date', 'last_login')
if(in_array($v, $allowed_values)) {
    // now we can use the variable
}
adam
short, easy to understand +1, thanks
cevik
A: 

You are talking about two kinds of data sanitation. One is about putting user-generated data in your database and the other is about putting user-generated data on your webpage. For the former you should follow adam's suggestions. For the later you should look into htmlspecialchars.

Do not mix these two as they do two completely different things. For that purpose sanitation should only take place at the last moment. Use adam's suggestion just before updating the database. Use htmlspecialchars just before echoing data. Do not use htmlspecialchars on data before adding it to the database.

You might also want to look around Stackoverflow, because this sort of question has been asked and answered countless times in the past.

Manos Dilaverakis
+1  A: 

Have a good idea of what you want from the user.

You want them to specify ascending/descending order? That's an enumeration (or a boolean), not part of an SQL query:

$query = "SELECT [...] ORDER BY field " . escape($_GET['sortOrder']); //wrong

This is wrong no matter how much you escape and sanitize their string, because this is not the way to validate an enumeration. Compare:

if ($_GET['sortOrder'] == 'desc') {
    $ascending = false;
} else {
    $ascending = true;
}

if ($ascending) {
    ...
} else {
    ...
}

...which does not warrant a discussion of string escaping or SQL injection because all you want from the user is a yes/no (or ascending/descending) answer.

You want them to enter a comment? Why disallow HTML tags? What if the user wants to enter HTML code?

Again, what you want from them is, say, "a text... any text with a maximum length of 1024 characters*." What does this have to do with SQL or injection? Nothing:

$text = $_POST['commentText'];

if (mb_strlen($text, ENCODING) <= 1024) {
    //valid!
}

The value in the database should reflect what the user entered verbatim; not translated, not escaped. Say you're stripping all HTML <tags> from the comment. What happens when you decide to send comments somewhere in JSON format? Do you strip JSON control characters as well? What about some other format? What happens if HTML introduces a tag called ":)"? Do you go around in your database stripping off smileys from all comments?

The answer is no, as you don't want HTML-safe, JSON-safe, some-weird-format-with-smileys-safe input from the user. You want text that is at maximum 1024 characters. Check for that. Store that.

Now, the displaying part is trickier. In order to display:

<b>I like HTML "tags"

in HTML, you need to write something like:

&lt;b&gt;I like HTML &quot;tags&quot;

In JSON, you would do:

{ "I like HTML \"tags\" }

That is why you should use your language facilities to escape the data when you're using it.

The same of course goes for SQL, which is why you should escape the data when using simple query functions like mysql_query() in PHP. (Parametrized queries, which you should really be using, on the other hand, need no escaping.)

Summary

Have a really good idea of what you want as the input, keeping in mind that you almost never need, say, "HTML-safe text." Validate against that. Escape when required, meaning escape HTML as you send to the browser, SQL as you send to the database, and so on.


*: You should also define what a "character" means here. UTF-8, for example, may use multiple bytes to encode a code point. Does "character" mean "byte" or "Unicode code point"?

aib
What about script-tags in a comment field with html support? Should I look for special tags, do it with RegExp?
cevik
With HTML support, the issue becomes one of validating the HTML. "XSS-safe, script-free HTML" should be the type of the user input for the argument in my answer. In that case, you could allow for a *very* basic subset of HTML (e.g. `<img src=... alt=... />` but not `<img alt=... src=.../>`) and validate using regex or simple string functions, or use some kind of markup as on this site. Regex would be nowhere near enough to parse real HTML (see this excellent answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)
aib
In that case I'd argue that you could HTML-escape the comments either before insertion to the database or just prior to displaying them, as both `<b>` and `<b>` are valid HTML. Note that if you escaped beforehand and dubbed the DB field "safe HTML", and decided to support the `<b>` tag later on, comments made before that time would not make use of the tag - which may or may not be what you want.
aib
What I'm trying to convey is that you should not store HTML-escaped comments in the DB *just because they'll be displayed on a web page* - you should keep the data as it is. When the data *is* HTML, however, things get a bit confusing. You just have to be absolutely certain of the type of data you're storing in the DB; then validate and convert before you insert, and escape and decorate before you display.
aib