tags:

views:

173

answers:

2

I'm looking for a C/C++ functional equivalent to HTML::Defang, and my Google-fu has not been able to uncover anything. I want to keep any benign tags and strip out/defang everything else. Lacking an actual library, any pointers to complete lists of tags/attributes/etc to defang would be appreciated. I know of http://en.wikipedia.org/wiki/DOM%5FEvents. Thanks.

A: 

In Java, I use JTidy to clean up HTML. I'm not sure if it would suit your needs, but if you Google for JTidy you can follow the link to a C/C++ implementation as well, and see if it does what you want.

As for what to defang: Look at the W3C specs for HTML; any tag not in there doesn't belong in HTML. But again, I could be misunderstanding your "defang" concept.

Carl Smotricz
Basically what I want is what web-based email systems do when presented with HTML email. Display what they can, nuke the rest, including any attacks.
MarkRWC
This is more an art than a science. I think you'd do well to let Tidy strip out any scripts. But I can't evaluate Tidy for you. Try it!
Carl Smotricz
A: 

libxml2 is free and should do what you want.

http://www.xmlsoft.org/

See this part of the API: http://www.xmlsoft.org/html/libxml-HTMLparser.html

The htmlReadFile() function might do the trick.

To get you started with libxml2 some examples can be found here:

http://www.xmlsoft.org/examples/index.html

jcoffland