views:

49

answers:

5

OK, there are many HTML/XML parsers for Java. What I want to do is a bit more than just knowing how to parse it. I want to filter the content and have it in suitable form.

More precisely, I want to keep only the text and images. However, I want to preserve some of the text formatting, too, like: italic, bold, alignment, etc.

All this is for the reason that I'm trying to implement a converter that converts html to a specific format that I've created myself for my own purposes.

Any ideas? Surely, it must have been done many times before.

Thanks, guys!

A: 

O.K. I think found it out: when parsing the Element I can construct a javax.swing.text.html.InlineView, i.e. InlineElement ie = new InlineView(element) and then get the attributes as ie.getAttributes).

Right. If you could help more, i.e. have some first-hand experience to share, please do!

Albus Dumbledore
+1  A: 

Have a look at HTML Parser, it could be handy.

George Profenza
Thanks, I will!
Albus Dumbledore
A: 

you can use xml dom parser under packages org.w3c.dom and javax.xml with that you can easily parse the document and get the node contents

 Document doc = DocumentBuilder.parse(file);

and then get the elements by using

NodeList nl = doc.getElementsByTagName("p"); // for paragraph tags

and then get the content from nodelist, it'll give u whole content in paragraph tag, like that you can apply for any tag

karthi
+2  A: 

JTidy + XSLT?

tulskiy
+4  A: 

If your intent is to clean user-submitted content against a safe white-list to prevent XSS, then I'd suggest to use Jsoup for this. It provides a builtin white-list. It's then as simple as:

String safeHtml = Jsoup.clean(unsafeHtml, Whitelist.basicWithImages());

You can customize the Whitelist as described in its javadoc.

See also:

BalusC
Damn, this JSoup is really well thought. +1
Pascal Thivent
Thanks. The link turned out to be ***very*** useful! As I *said* I am trying to convert HTML to my custom format. Jsoup is quite promising, but HtmlUnit *is* quite close to the point! Thanks a lot!
Albus Dumbledore
You're welcome :) After cleaning you could use Jsoup as well to iterate over all HTML elements and convert/transform each into another markup. You can also do this with XSLT, it may only end up to be pretty complex since you've to specify every HTML element and/or attribute separately.
BalusC