I am using NekoHtml. It failed parsing html from sites like mercurynews.com into DOM. Any solution to the problem?
+3
A:
Beyond switching to other parsers? If the site has consistent error patterns, you could hot fix them via series of regex before passing on to the parser.
kd304
2009-07-14 19:10:42
A:
I have used the Cobra renderer from the "Lobo Project" (http://lobobrowser.org/cobra.jsp) for parsing less-than-friendly HTML and it has worked well. It's API is also very easy to use.
Hope this helps.
cjstehno
2009-07-14 20:49:41
A:
Use JTidy to tidy it before parsing, or better yet use it as the parser
ykaganovich
2009-07-14 21:57:28
I find that JTidy is slow and is not maintained since 2000.
Lu
2009-07-14 23:18:04
A:
I don't know what "sites like" means, but MercuryNews.com and most news sites have an RSS interface.
ykaganovich
2009-07-14 22:03:14
RSS only provides short snippets on most of the sites. I am interested in parsing the full articles, which are in html format.
Lu
2009-07-14 23:19:31