views:

603

answers:

3

We have a java widget that does some basic parsing on arbitrary xhtml documents, and we've been using jTidy to clean them up before processing.

For a couple of reasons (which are outside the scope of this particular question,) we're looking to replace jTidy with a different library.

Can anyone recommend something? We're looking for something that will take a URI, clean up the xml, and produce an object that implements org.w3c.dom.Document (or something that can be turned into a Document without too much effort or damage.)

And, as you might imagine, something that's free is also a bonus.

+4  A: 

NekoHTML is the way to go. I've used it in many contexts for years to parse HTML into XML and it is always up to the task, often in places where JTidy fails to deliver.

Alex Vigdor
+4  A: 

TagSoup is wonderful, and has less problematic dependencies than nekohtml. Apache Tika switched from neko to TagSoup.

bmargulies
+4  A: 

TagSoup, Jericho and NekoHTML are all good to parse any kind of crap (I like especially the 2 first one).

Another alternative is HTMLCleaner which looks promising. Quoting its announcement on TSS (actually, I suggest to read the whole thread):

HTMLCleaner is Java library used to safely parse and transform any HTML found on web to well-formed XML. It is designed to be small, fast, flexible and independant. HtmlCleaner may be used in java code, as command line tool or as Ant task. Result of parsing is lightweight document object model which can easily be transformed to standards like DOM or JDom, or serialized to XML output in various ways (compact, pretty printed and so on).

I'd give HTMLCleaner a try.

Pascal Thivent
HTMLCleaner was exactly what I needed. Thanks!
Electrons_Ahoy