How to handle/parse ill-formed html into DOM in java?

views:

404

answers:

How to handle/parse ill-formed html into DOM in java?

I am using NekoHtml. It failed parsing html from sites like mercurynews.com into DOM. Any solution to the problem?

+3 A:

Beyond switching to other parsers? If the site has consistent error patterns, you could hot fix them via series of regex before passing on to the parser.

kd304 2009-07-14 19:10:42

+4 A:

Have you considered Tag Soup?

http://home.ccil.org/~cowan/XML/tagsoup/

Janie 2009-07-14 19:13:32

You may consider using the Swing HTML parser.

http://www.rkcole.com/articles/swing/HTMLParser.html

Thorbjørn Ravn Andersen 2009-07-14 19:20:18

I have used the Cobra renderer from the "Lobo Project" (http://lobobrowser.org/cobra.jsp) for parsing less-than-friendly HTML and it has worked well. It's API is also very easy to use.

Hope this helps.

cjstehno 2009-07-14 20:49:41

Use JTidy to tidy it before parsing, or better yet use it as the parser

ykaganovich 2009-07-14 21:57:28

I find that JTidy is slow and is not maintained since 2000.

Lu 2009-07-14 23:18:04

I don't know what "sites like" means, but MercuryNews.com and most news sites have an RSS interface.

ykaganovich 2009-07-14 22:03:14

RSS only provides short snippets on most of the sites. I am interested in parsing the full articles, which are in html format.

Lu 2009-07-14 23:19:31

ansaurus

tags:

views:

answers:

How to handle/parse ill-formed html into DOM in java?

related questions