I need to parse HTML 4 in Java. Ideally I'd like an implementation that is SAX compatible.
I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.
My requirements are:
- No tidying.
- If the input document is invalid HTML parsing should fail.
- The document should be validatable against the HTML DTDs.
- The parser can produce SAX2 events.
Is there a library that meets these requirements?