views:

3688

answers:

5

Is there a nice parser/library which is able to read an HTML document into a DOM tree usinf Java? I'd like to use the standard DOM/Xpath API that Java provides.

But all libraries I can find only seem have custom APIs to solve this task. Furthermore the conversion HTML to XML-DOM seems unsupported by the most of the available parsers.

Any ideas or experience with a good HTML DOM parser?

Thanks in advance

+1  A: 

Apache's Xerces2 parser should do what you want.

Pesto
+3  A: 

JTidy, either by processing the stream to XHTML then using your favourite DOM implementation to re-parse, or using parseDOM if the limited DOM imp that gives you is enough.

Alternatively Neko.

bobince
Neko + Xerces do the job quite well. Thanks to all answering
Stefan Teitge
Beware of JTidy. It has a memory leak bug. If you run it in a production system it will eventually blow up - StackOverflowError and eventually OutOfMemoryError. That said, it is wonderfully good at fixing broken html so that you can feed it into a dom parser.
Joel
+1  A: 

TagSoup can do what you want.

Peter Štibraný
+1  A: 
Ichiro Furusato
A: 

I tried using JTidy for parsing an HTML document, which even JTidy failed to parse. It ended up giving me lot of warnings. Unfortunately, the HTML for which I am trying to write a program in java to extract its data displays well on the browser but is too malformed for even JTidy. This HTML is standard format for my bank statement and I want to parse it. Its a standard format the bank sends me every week. Does anyone know of a favorable parser for such a case so that i can use java APi to extract tags and their contents. I am interested in only one table tag in the entire html.

Vikasmlb