Text extraction with java html parsers

views:

286

answers:

Text extraction with java html parsers

I want to use an html parser that does the following in a nice, elegant way

Extract text (this is most important)
Extract links, meta keywords
Reconstruct original doc (optional but nice feature to have)

From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?

I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. It's really easy to use and was quick for what I needed.

William 2010-04-09 18:48:26

A quick Google search I found this.

Hope that helps :)

npinti 2010-04-09 18:51:51

+1 A:

I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.

I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.

There's a list of open source java html parsers here: http://java-source.net/open-source/html-parsers

Finbarr 2010-04-09 19:17:37

ansaurus

tags:

views:

answers:

Text extraction with java html parsers

related questions