ansaurus

Question

Possible to parse a HTML document and build a DOM tree(java)

Answer 1

+1 A:

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.

It is distributed under the Apache 2.0 license.

Guido 2009-09-16 14:19:16

Answer 2

A:

HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

Johannes Weiß 2009-09-16 14:20:45

Answer 3

A:

There are several open source tools to parse HTML from Java.

Check http://java-source.net/open-source/html-parsers

Also you can check answers to this question: http://stackoverflow.com/questions/457684/reading-html-file-to-dom-tree-using-java It is almost the same...

JuanZe 2009-09-16 14:21:24

Answer 4

A:

JTidy should let you do what you want.

Usage is fairly straight forward, but parsing is configurable. e.g.:

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

The JavaDoc is hosted here.

Andy 2009-09-16 14:23:05

Answer 5

+3 A:

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

Thiyagaraj 2009-09-16 14:49:38

TagSoup is very good, especially if you have to parse crappy HTML

Pascal Thivent 2009-09-16 14:59:26

ansaurus

tags:

views:

answers:

Possible to parse a HTML document and build a DOM tree(java)

related questions