I am using the Xerces implementation within JDK6 to perform XPath queries on an HTML 4.0 transitional document. With the following code:
XPath newXPath = XPathFactory.newInstance().newXPath();
XPathExpression xpathExpr = newXPath.compile(expression);
Object xPathResult = xpathExpr.evaluate(inputSource, XPathConstants.NODESET);
Where inputSource
is built from a FileInputStream
, I receive the exception:
Caused by: org.xml.sax.SAXParseException: The entity "mdash" was referenced, but not declared. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283) at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:291)
This message is also printed to the standard output:
[Fatal Error] :20:43: The entity "mdash" was referenced, but not declared.
How can I avoid this exception?
The HTML file is created from an XSLT transformation from XML. I don't think I necessarily need it to be an —
, I'm not sure. The HTML is to be displayed in a Java Swing application.
It's difficult for me to judge what information from my specific implementation is relevant for this problem. Please let me know by comments if more information is needed.
So, I was under the bad misconception that HTML was XML (a result of not actually thinking of that at all).
So, given an HTML file, how do I go about solving this problem?
- Giving the parser the DTD for HTML 4?
- Replace
—
with the equivalent. The HTML is created from an XSLT transform, can the stylesheet be set to replace mdash with the equivalent numeric symbol? - Is there any libraries which would fix the HTML before it's given to the parser? I've noticed JTidy being mentioned for similar purposes.