views:

86

answers:

3

Hello, I use Owasp Anti samy with Ebay policy file to prevent XSS attacks on my website.

I also use Hibernate search to index my objects.

When I use this code:

String html = "special word: été";    

// use the Ebay configuration file    
Policy policy = Policy.getInstance(xssPolicyFile.getInputStream());

AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(html, policy);

// result is now : "special word: été"
result = cr.getCleanHTML();

As you can see all chars "é" has been transformed to their html entity equivalent "é"

My page is on UTF-8, so I don't need this transformation. Moreover, when I index this text with Hibernate Search, it indexes the word with html entities, so I can't find word "été" on my index.

How can I force antisamy to not transform special chars to their html entity equivalent ?

thanks

A: 

After scouring the AntiSamy source code, I found no way of changing this behavior apart from modifying AntiSamy.

A: 

Check out this one: http://code.google.com/p/owaspantisamy/source/browse/#svn/trunk/dotNet/current/source/owaspantisamy/html/scan

Grab the source and notice that key classes (AntiSamyDOMScanner, CleanResults) use standard framework objects (like XmlDocument). Compile and run with the binary you compiled - so that you can see everything in a debugger - as in which of the major classes actually corrupts your data. With that in hand you'll be able to either change a few properties on major objects to make it stop or inject your own post-processing to revert the wrongdoing (say with a regexp). Latter you can expose that as additional top-level property, say one named NoMess :-)

Chances are that behavior in that respect is different between languages (there's 3 in that trunk) but the same tactics will work no matter which one you have to deal with.

ZXX
I've thinking of that but that's very tricky because I use Maven and so, I never download entire project source to modify it.That's strange that Antisamy does not implement such a feature.I think I will do it but it's really ugly.
Jerome C.
Well it's a good excuse to start decoupling :-) Might also be the matter of "assumed" output encoding.
ZXX
+1  A: 

I ran into the same problem this morning.

I have encapsulated antisamy in a class and I use apache StringEscapeUtil from apache common-lang to restore special characters.

 CleanResults cleanResults = antiSamy.scan(taintedHtml);
 cleanedHtml = cleanResults.getCleanHTML();  
 return StringEscapeUtils.unescapeHtml(cleanedHtml)

The result is a cleaned up HTML without the HTML escaping of special characters.

Hope this helps.

Vincent Giguère