Is there a validating HTML parser implemented in Java?

views:

393

answers:

+2 Q:

Is there a validating HTML parser implemented in Java?

I need to parse HTML 4 in Java. Ideally I'd like an implementation that is SAX compatible.

I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.

My requirements are:

No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.

Is there a library that meets these requirements?

+1 A:

You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...

adrian.tarau 2009-05-24 18:16:54

"TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild..." Unfortunately not.

johnstok 2009-05-24 18:24:54

"It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on."

adrian.tarau 2009-05-24 18:59:16

If it is able to populate default attributes this mean it parses the DTD...it's not clear if it fails if the document fails to be validated.

adrian.tarau 2009-05-24 19:10:30

Also have a look at javax.swing.text.html.parser.Parser, it looks like it does DTD validationprotected void endTag(boolean omitted) { handleText(stack.tag); if (omitted } else if (!stack.terminate()) { error("end.unexpected", stack.elem.getName()); }

adrian.tarau 2009-05-24 19:14:56

+1 A:

You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.

monceaux 2009-05-25 08:34:36

You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

David Rabinowitz 2009-05-25 10:12:10

ansaurus

tags:

views:

answers:

Is there a validating HTML parser implemented in Java?

related questions