I'm sure it can be done fx. in PHP by validating forms
Not really. The input stage is entirely the wrong place to be addressing XSS issues.
If the user types, say <script>alert(document.cookie)</script>
into an input, there is nothing wrong with that in itself. I just did it in this message, and if StackOverflow didn't allow it we'd have great difficulty talking about JavaScript on the site! In most cases you want to allow any input(*), so that users can use a <
character to literally mean a less-than sign.
The thing is, when you write some text into an HTML page, you must escape it correctly for the context it's going into. For PHP, that means using htmlspecialchars()
at the output stage:
<p> Hello, <?php echo htmlspecialchars($name); ?>! </p>
[PHP hint: you can define yourself a function with a shorter name to do echo htmlspecialchars
, since this is quite a lot of typing to do every time you want to put a variable into some HTML.]
This is necessary regardless of where the text comes from, whether it's from a user-submitted form or not. Whilst user-submitted data is the most dangerous place to forget your HTML-encoding, the point is really that you're taking a string in one format (plain text) and inserting it into a context in another format (HTML). Any time you throw text into a different context, you're going to need an encoding/escaping scheme appropriate to that context.
For example if you insert text into a JavaScript string literal, you would have to escape the quote character, the backslash and newlines. If you insert text into a query component in a URL, you will need to convert most non-alphanumerics into %xx
sequences. Every context has its own rules; you have to know which is the right function for each context in your chosen language/framework. You cannot solve these problems by mangling form submissions at the input stage—though many naïve PHP programmers try, which is why so many apps mess up your input in corner cases and still aren't secure.
(*: well, almost any. There's a reasonable argument for filtering out the ASCII control characters from submitted text. It's very unlikely that allowing them would do any good.
Plus of course you will have application-specific validations that you'll want to do, like making sure an e-mail field looks like an e-mail address or that numbers really are numeric. But this is not something that can be blanket-applied to all input to get you out of trouble.)