ansaurus

Question

Best regex to catch XSS (Cross-site Scripting) attack (in Java)?

Answer 1

+4 A:

The regex shown in your example, should work regardless of language.

So is it the regex you want, or the java code to put this logic around the regex?

DevelopingChris 2008-08-24 00:45:46

Answer 2

+6 A:

The Open Web Application Security Project (OWASP) have a few suggestion for sanitizing your input. See for instance:

Einar 2008-08-26 09:46:48

Answer 3

A:

The biggest problem by using jeffs code is the @ which currently isnt available.

I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

and see the things needing escape get escaped and then use it.

Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.

(Thats propbably pretty sound advice for any copy/paste)

svrist 2008-08-26 10:25:15

Answer 4

+4 A:

I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':

< a alt="xss injection" href="http://www.malicous.com/bad.php" >

A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.

potyl 2009-02-10 19:04:57

+1, see my response for a real-world Java API that does exactly that

Chase Seibert 2009-02-11 01:15:46

Answer 5

+28 A:

Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.

For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?

<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS&lt;SCRIPT DEFER&gt;alert(&quot;XSS&quot;)&lt;/SCRIPT&gt;">
</BODY></HTML>

How about this attack that works on IE6?

<TABLE BACKGROUND="javascript:alert('XSS')">

How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that page adeptly notes:

The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.

I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML.

The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:

public String toSafeHtml(String html) throws ScanException, PolicyException {

    Policy policy = Policy.getInstance(POLICY_FILE);
    AntiSamy antiSamy = new AntiSamy();
    CleanResults cleanResults = antiSamy.scan(html, policy);
    return cleanResults.getCleanHTML().trim();
}

Chase Seibert 2009-02-11 00:59:10

AntiSamy looks great! Also, using different policies is a nice idea as it keeps the cleaning rules outside of the code making it easier to maintain. This is clearly a very nice approach. Kudos.

potyl 2009-02-11 04:05:53

+1. You cannot reliably process HTML using regex. Parsing it into a easily-filterable DOM, then using and known-good serialisation, is by far the more sensible approach.

bobince 2009-02-11 13:34:57

I really like this answer, as it does not answer directly the question, but address the issue instead!

Thierry-Dimitri Roy 2009-02-11 15:13:31

Answer 6

A:

[\s\w\.]*. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with & is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.

The idea to just parse it as html and generate new html is probably better.

Brian 2009-02-11 19:30:40

Answer 7

A:

^(\s|\w|\d|<br>)*?$

This will validate characters, digits, whitespaces and also the <br> tag. If you want more risk you can add more tags like

^(\s|\w|\d|<br>|<ul>|<\ul>)*?$

2009-06-01 22:39:03

ansaurus

tags:

views:

answers:

Best regex to catch XSS (Cross-site Scripting) attack (in Java)?

related questions