views:

404

answers:

6

I am using NekoHtml. It failed parsing html from sites like mercurynews.com into DOM. Any solution to the problem?

+3  A: 

Beyond switching to other parsers? If the site has consistent error patterns, you could hot fix them via series of regex before passing on to the parser.

kd304
+4  A: 

Have you considered Tag Soup?

http://home.ccil.org/~cowan/XML/tagsoup/

Janie
A: 

You may consider using the Swing HTML parser.

http://www.rkcole.com/articles/swing/HTMLParser.html

Thorbjørn Ravn Andersen
A: 

I have used the Cobra renderer from the "Lobo Project" (http://lobobrowser.org/cobra.jsp) for parsing less-than-friendly HTML and it has worked well. It's API is also very easy to use.

Hope this helps.

cjstehno
A: 

Use JTidy to tidy it before parsing, or better yet use it as the parser

ykaganovich
I find that JTidy is slow and is not maintained since 2000.
Lu
A: 

I don't know what "sites like" means, but MercuryNews.com and most news sites have an RSS interface.

ykaganovich
RSS only provides short snippets on most of the sites. I am interested in parsing the full articles, which are in html format.
Lu