views:

113

answers:

2

I have implemented a search engine in C for my html website. My entire web is programmed in C.

I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):

<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>

Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !

A: 

Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.

Jeff Meyers
+6  A: 

Input sanitisation is not inherently ‘necessary’.

It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).

But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.

HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape < to &lt;, & to &amp;, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.

If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include <, & and " characters.

And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.

Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ?

Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use <, & or "! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!

< is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.

My entire web is programmed in C.

I'm so sorry.

bobince
+1 for the last line.
Russell Dias
@bobince: You mention filtering double quotes. What about single quotes ?
bobby
Well in an HTML attribute value you have to escape whichever quote you've used as the delimiter: `<div title="He said "That's OK"">` and `<div title='He said "That's OK"'>` are equally valid, though the double-quote delimiter is by far the more commonly used which is why I mentioned that in particular. To be safe, you can always escape both. Single quotes also have to be escaped (in a different way) when creating SQL queries of course.
bobince
@Russell Dias: Last line? I had a web analytics counter installed (written in PHP) that slowed down the site considerably with only a few days worth of data !
bobby
That would be an application problem, usually a poorly-designed database schema (and yes, many PHP apps really are very poorly designed). High-level scripting languages are used for the vast majority of web sites without speed being a big issue; if you're using C through CGI, the startup costs and inability to pool database connections and cache resources will cost much more than you gain by using the low-level language.
bobince
@bobince: Yes, they used a flat file (exec by PHP) rather than a database (exec by C/C++). Most web hosts install PHP as a FastCGI binary vs apache module for security reasons. You can get the best of both worlds by using PHP for the front end and C/C++ for the back end (which is what Yahoo does I think.)
bobby