I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?
I've tried Cobra's built in one and HTMLCleaner without any luck.
I was using Cobra until now because of how easy it was but unfortunately it had some problem with a few test cases. Does anyone suggest a tried-and-tested library?
I've tried Cobra's built in one and HTMLCleaner without any luck.
Take a look at Saxon (no, I'm not involved in any way with the product, just a satisfied user).
Mozilla HTML Parser looks rather interesting. By definition, it's supposed to be as good as Gecko engine itself, which is likely to cover your needs.
TagSoup is really great when dealing with crappy HTML/XHTML.
Jericho (and NekoHTML) are good too to parse non valid HTML.
TagSoup and Jericho: tried-and-tested. NekoHTML: feedback from trustable source.
[Answering the title - the overall question and comments are not consistsent]
JTidy (http://jtidy.sourceforge.net/) is a port of Dave Raggett's HTMLTidy. It's very useful though I think development may have slowed/ceased.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. (Mozilla is currently in the process of replacing its own HTML parser with this one.)