Why is filter_input() incomplete?

I am working a lot on a PHP-based CMS at the moment, and while I'm at it I would like to move all the handling and sanitation of user input to one central place. (At the moment, it's a $_REQUEST here, a $_GET there, and so on).

I like filter_input() very much and would like to use it for basic sanitation, but I'm unclear as to whether this function is really production ready. For example, the documentation names the following parameters for $type

INPUT_GET, INPUT_POST, INPUT_COOKIE, INPUT_SERVER, INPUT_ENV, INPUT_SESSION (not implemented yet) and INPUT_REQUEST (not implemented yet).

the function exists since 5.2.0, why are two crucial elements not implemented yet? If I want to fetch data from $_REQUEST, you have to use a workaround from the user contributed notes. Is there a special reason for this? Is this function still in some kind of beta? Is it trustworthy as the first call to handle incoming data?

Maybe somebody familiar with the PHP development process can shed some light on this.

I would like to move all the handling and sanitation of user input to one central place

Yes, how lovely that would be. It can't be done. That's not how text processing works.

If you're inserting text from one context into another you need to use the right escapes. (mysql_real_escape_string for MySQL string literals, htmlspecialchars for HTML content, urlencode for URL parameters, others for specific contexts). At the start of your script when you're filtering, you don't know where your input is going to end up, so you don't know how to escape it.

Maybe one input string is going both into the database (needs to be SQL-escaped) and directly onto the page (needs to be HTML-escaped). There's no one escape that covers both those cases. You can use both escapes one after the other, but then the value in the HTML will have weird backslashes appearing in it and the copy in the database will be full of ampersands. A few rounds of this misencoding and you get that situation where every time you edit something, long strings of \\\\\\\\\\\\\\\\\\\\ and &amp;amp;amp;amp;amp;amp;amp;amp;amp; come out.

The only way you can safely filter in one go at start time is by completely removing all characters that need to be escaped in any of the contexts you're going to be using them in. But that means no apostrophes or backslashes in your HTML, no ampersands or less-thans in your database, and probably a whole load of other URL-unfriendly punctuation has to go too. For a simple site that doesn't take arbitrary text you could maybe get away with that. But usually not.

So you can only escape on the fly when one type of text goes into another. The best strategy to avoid the problem is to avoid concatenating text into other contexts as much as much as you possibly can, for example by using parameterised queries instead of SQL string building, and either defining an echo(htmlspecialchars()) function with a nice short name to make it less work to type, or using an alternative templating system that HTML-escapes by default.

You do write awfully long answers (good explanation though).

C. Ross 2009-11-27 17:46:17

@bobince: I say it can be done to a degree if you 1.) know what you will need in the script, and 2.) mark the sanitized variables as what they are. I have had my share of \\\\\\\\\\\\\\'s to know what you're talking about. :) My main goal is to have one basic "security checkpoint" with a defined set of checks, instead of pulling stuff from the arrays all over the code.

Pekka 2009-11-27 17:49:56

A simple explanation is that filtering/sanitization is only part of the process your data needs to go through. Sanitized data still needs to be escaped. e.g. you would not stick an email address unquoted into a SQL query, no matter how valid it is.

Ben James 2009-11-27 17:56:03

@Ben James: Yes, absolutely. But I still don't see what's wrong with pulling incoming data in one place, and run a number of generic checks on it first. Say you have a "from" parameter that redirects to a different URL. Regardless of how it is output, you have, say, the common characteristic that you don't want it to point to an external URL. I have read the links in @stereofrog's reply and acknowledge the points made; still, checks like this are something I want to check at input time, and not at output time. Maybe "sanitation" was the wrong choice of word, in that case I apologize.

Pekka 2009-11-27 18:57:00

Well, you could make yourself a separate `$_POST_html_encoded` array, a `$_POST_sql_encoded` array and a `$_POST_url_encoded` array, then choose which one you needed at each point. But you'd still need to do the same for every other source of data, so each `$row` you grab from the database you have to make `$row_html_encoded`, `$row_sql_encoded` and so on. In the end this gets you nothing.

bobince 2009-11-27 20:16:27

@bobince: Did you read my comment above?

Pekka 2009-12-01 14:34:29

@Pekka: sure, you generally need to do application-specific, parameter-specific checks on input. But those kind of checks can't be automated or applied over everything at once as a cure-all; each has to be done individually. You *could* use `filter_input` to check each parameter one at a time, but it would be no different to just using whichever existing filter or escape you wanted for that particular field, and it wouldn't take care of escaping issues at all.

bobince 2009-12-01 15:03:00

@bobince: I am not planning to automate anything. I discovered filter_input() not long ago, and thought to myself: Hey, why not use this? And if it's just to attach custom checking functions to it like, say, local_url_only(). Why not use filter_input() as an agreed coding standard to pull input from the outside rather than fetching $_XYZ[]? Still, your point is well noted. Anyway, my initial question was why the implementation of a basic PHP function is so blatantly incomplete, without any explanation, for a number of versions now. Is there a history behind this? A conflict? Sloppiness?

Pekka 2009-12-01 15:45:11

Ah, okay, I guess I'm a little oversensitive to the word “sanitisation” and how it is usually applied to PHP. The filter functions are a messy bag of unrelated string-formatting functions, some of which are useful for validation but some of which are quite unsuitable for an input stage. I agree that it's a bit rough. Although INPUT_REQUEST wouldn't be needed much if you can use INPUT_POST|INPUT_GET|INPUT_COOKIE (who wants to lump cookies into form pars anyway?), it would seem trivial enough to implement. There doesn't seem to have been much development since filter moved from PECL to builtin.

bobince 2009-12-01 19:08:11

I saw your comment only now. Cheers bobince, I didn't know these functions came from PECL. That explains a lot.

Pekka 2009-12-07 08:48:04

ansaurus

tags:

views:

answers:

Why is filter_input() incomplete?

related questions