views:

1334

answers:

7

I code a lot of parsers. Up till now, I was using HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

As 80% of my work involves just parsing, I want to use a light html parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which html parser is best. The parser would be better if it is close to HtmlUnit parser.

==============================Edited=========================================

By best, I want at least the following features:

  1. Speed
  2. Ease to locate any HtmlElement by its "id" or "name" or "tag type".

It would be ok for me if it doesn't clean the dirty html code. I don't need to clean any html source. I just need an easiest way to move across HtmlElements and harvest data from them.

+5  A: 

The best I've seen so far is HtmlCleaner:

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

With HtmlCleaner you can locate any element using XPath.

For other html parsers see this SO question.

tangens
A: 

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Ms2ger
+2  A: 

You can try Jericho Parser

Massimo Fazzolari
A: 

I had a good experience with Cobra.

craftsman
A: 

I would recommend http://mozillaparser.sourceforge.net/

Andriy Sholokh
+1  A: 

NekoHtml

Jay Askren
+7  A: 

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("#head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Jonathan Hedley
Jsoup is pretty slick, man. Nice work.
JMTyler
This thing is fantastic, and I love the CSS selector support. I barely know I'm using a Java library. :-)
William Pietri