ansaurus

Question

Answer 1

+5 A:

The best I've seen so far is HtmlCleaner:

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

With HtmlCleaner you can locate any element using XPath.

For other html parsers see this SO question.

tangens 2010-01-30 16:53:33

Answer 2

A:

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)

Ms2ger 2010-01-30 17:32:36

Answer 3

+2 A:

You can try Jericho Parser

Massimo Fazzolari 2010-01-30 18:34:36

Answer 4

A:

I had a good experience with Cobra.

craftsman 2010-01-30 19:14:40

Answer 5

A:

I would recommend http://mozillaparser.sourceforge.net/

Andriy Sholokh 2010-01-30 22:07:37

Answer 6

+1 A:

NekoHtml

Jay Askren 2010-01-30 22:37:08

Answer 7

+7 A:

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("#head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Jonathan Hedley 2010-01-31 07:41:16

Jsoup is pretty slick, man. Nice work.

JMTyler 2010-06-16 05:28:22

This thing is fantastic, and I love the CSS selector support. I barely know I'm using a Java library. :-)

William Pietri 2010-09-16 00:25:28

ansaurus

tags:

views:

answers:

Which Html Parser is best?

related questions