ansaurus

Question

What are the pros and cons of the leading Java HTML parsers?

Answer 1

+2 A:

Add The validator.nu HTML Parser, an implementation of the HTML5 parsing algorithm in Java, to your list.

On the plus side, it's specifically designed to match HTML5, and at the heart of the HTML5 validator, so highly likely to match future browser's parsing behaviour to a very high degree of accuracy.

On the minus side, no browsers' legacy parsing works exactly like this, and as HTML5 is still in draft, subject to change.

In practice, such problems only affect obscure corner cases, and is for all practical purposes, an excellent parser.

Alohci 2010-06-30 18:39:53

Answer 2

+4 A:

This article compares certain aspects of the following parsers:

NekoHTML
JTidy
TagSoup
HtmlCleaner

It is by no means a complete summary, and it is from 2008. But you may find it helpful.

Matt Solnit 2010-06-30 20:43:02

Answer 3

+1 A:

I found Jericho HTML Parser to be very well written, kept up to date (which many of the parsers are not), no dependencies, and easy to use

MJB 2010-06-30 23:09:32

Answer 4

+11 A:

General

Almost all known HTML parsers implements W3C DOM API (part of JAXP, Java API for XML processing) and gives you a org.w3c.dom.Document back. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, HtmlCleaner and TagSoup. You usually use this kind of HTML parsers to "tidy" the HTML source so that you can traverse it "the usual way" using the W3C DOM and JAXP API.

The only ones which jumps out are HtmlUnit and Jsoup.

HtmlUnit

HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.

Jsoup

Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a very nice API to traverse the HTML DOM tree. It's in my opinion a real revolution. Ones who have worked with org.w3c.dom.Document knows what a hell of pain it is to traverse the DOM to get the elements of interest using verbose NodeList and Node API's. True, XPath makes the life easier, but still, it's another learning curve and it can end up to be pretty verbose.

Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big).

String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();

Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());

NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
    System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}

And here's an example how to do exactly the same with Jsoup:

String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();

String question = document.select("#question .post-text p").first().text();
System.out.println("Question: " + question);

Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
    System.out.println("Answerer: " + answerer.text());
}

Do you see the difference? For me as being a webdeveloper with a decade of experience, Jsoup was easy to grasp thanks to the support for CSS selectors which I am already familiar with.

Summary

The pro's and cons of each should be obvious enough. If you just want to use a XML based tool to traverse it, then just go for the first mentioned group of parsers. Which one to choose depends on the features it provides and the robustness of the library (how often is it updated/maintained/fixed?). There are pretty a lot of them. My personal preference of them is JTidy (HtmlCleaner is also nice, it was the best choice until JTidy finally updated their API last year after years of absence). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML, then Jsoup is the way to go.

BalusC 2010-07-01 00:00:32

Wow, great answer. Thanks!

Avi Flax 2010-07-02 03:03:11

You're welcome.

BalusC 2010-07-02 03:28:54

ansaurus

tags:

views:

answers:

What are the pros and cons of the leading Java HTML parsers?

General

HtmlUnit

Jsoup

Summary

related questions