Antisamy is an HTML content filter meant for allowing an untrusted user to input a limited subset of ‘safe’ HTML. It is not an all-purpose input filter that can save you from having to think about string escaping and XSS issues.
You should use antisamy only to clean up content that will contain HTML that you wish to output verbatim on a page. Most user input is generally not HTML: when a user types a<b or c>d
, they should usually get the literal less-than and greater-than characters and not a bold tag. To ensure this happens correctly, you must HTML-escape all text content that gets inserted into your page at the output stage, instead of anything to do with antisamy.
1234%27%2Balert%2873918%29%2B%27
This looks nothing like a typical HTML injection attack. The only ‘special’ character it contains is an apostrophe, which isn't usually special in HTML, and can't practically be filtered out of input because users do generally need to use apostrophes for writing in English.
If this is causing script injection for your application, you've got bigger problems than anything antisamy can solve. If this is causing your page to pop up an alert()
dialogue, you are probably using the value unescaped in a JavaScript string literal, for example something like:
<a href="..." onclick="callfunc('hello <%= somevar %>');">
Putting text content into JavaScript code as a string literal requires another form of escaping; one that turns the '
character (the %27
in the URL-encoded input) into a backslash-escaped \'
, and \
itself into \\
(as well as a few other replacements).
The easy way to get values (strings or otherwise) from a server-side scripting language into a JavaScript literal is to use a standard JSON encoder.
However, in the above case, the JavaScript string literal is itself contained inside an HTML attribute, so you would have to HTML-encode the results of the JSON encoder. This is a bit ugly; it's best to avoid inline event handler attributes. Use external scripts and <script>
elements instead, binding events from JS instead of HTML.
Even in a <script>
block, where you don't generally need to HTML-encode, you have to beware of a string </script>
(or, generally, anything beginning </
, which can end the block). To avoid that sequence you should replace the <
character with something else, eg. \x3C
. Some JSON encoders may have an option to do this for you to save the trouble.
There are many other places where inserting content into a containing language requires special sorts of encoding. Each has its own rules. You can't avoid the difficulty of string encoding by using a general-purpose input filter. Some “anti-XSS” filters try, but they invariably fail miserably.