views:

385

answers:

3

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&,  , etc.) and handles <br> and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)
+1  A: 

HtmlUnit, it even shows the page after processing JavaScript/Ajax.

Ahmed Ashour
I see how it gives me the response as HTML, not text
David Rabinowitz
Check .asText() [http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomNode.html#asText()]
Ahmed Ashour
Thanks. I went for Jericho at the end, but I'll keep HtmlUnit in mind
David Rabinowitz
A: 

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

Rich Seller
Thanks, but I need the final result in plain text
David Rabinowitz
Once it is in XML, you can implement a SAX parser to output only the text nodes (e.g. a DefaultHandler no-op implementations of all methods apart from `characters`)
Rich Seller
+2  A: 

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

Chris R
Here's the link to that class: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html
Chris R
Thanks! I actually used the Renderer at the end
David Rabinowitz