tags:

views:

4811

answers:

7

Jeff actually posted about this in Sanitize HTML. But his example is in C# and I'm actually more interested in a Java version. Does anyone has a better version for Java? Does his example is good enough that I could just convert it directly from C# to Java?

[Update] I have put a bounty on this question because SO wasn't as popular as today when I asked the question (*). As for anything related to security, the more people look into it, the better it is!

(*) In fact, I think it was still in closed beta

+4  A: 

The regex shown in your example, should work regardless of language.

So is it the regex you want, or the java code to put this logic around the regex?

DevelopingChris
A: 

The biggest problem by using jeffs code is the @ which currently isnt available.

I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

and see the things needing escape get escaped and then use it.


Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.

(Thats propbably pretty sound advice for any copy/paste)

svrist
+4  A: 

I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':

< a alt="xss injection" href="http://www.malicous.com/bad.php" >

A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.

potyl
+1, see my response for a real-world Java API that does exactly that
Chase Seibert
+28  A: 

Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.

For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?

<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS&lt;SCRIPT DEFER&gt;alert(&quot;XSS&quot;)&lt;/SCRIPT&gt;">
</BODY></HTML>

How about this attack that works on IE6?

<TABLE BACKGROUND="javascript:alert('XSS')">

How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that page adeptly notes:

The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.

I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML.

The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:

public String toSafeHtml(String html) throws ScanException, PolicyException {

    Policy policy = Policy.getInstance(POLICY_FILE);
    AntiSamy antiSamy = new AntiSamy();
    CleanResults cleanResults = antiSamy.scan(html, policy);
    return cleanResults.getCleanHTML().trim();
}
Chase Seibert
AntiSamy looks great! Also, using different policies is a nice idea as it keeps the cleaning rules outside of the code making it easier to maintain. This is clearly a very nice approach. Kudos.
potyl
+1. You cannot reliably process HTML using regex. Parsing it into a easily-filterable DOM, then using and known-good serialisation, is by far the more sensible approach.
bobince
I really like this answer, as it does not answer directly the question, but address the issue instead!
Thierry-Dimitri Roy
A: 

[\s\w\.]*. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with &amp; is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.

The idea to just parse it as html and generate new html is probably better.

Brian
A: 
^(\s|\w|\d|<br>)*?$

This will validate characters, digits, whitespaces and also the <br> tag. If you want more risk you can add more tags like

^(\s|\w|\d|<br>|<ul>|<\ul>)*?$