views:

313

answers:

5

I am using the Xerces implementation within JDK6 to perform XPath queries on an HTML 4.0 transitional document. With the following code:

XPath newXPath = XPathFactory.newInstance().newXPath();
XPathExpression xpathExpr = newXPath.compile(expression);
Object xPathResult = xpathExpr.evaluate(inputSource, XPathConstants.NODESET);

Where inputSource is built from a FileInputStream, I receive the exception:

Caused by: org.xml.sax.SAXParseException: The entity "mdash" was referenced, but not declared.  
 at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239) 
 at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
 at com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:291)  

This message is also printed to the standard output:

[Fatal Error] :20:43: The entity "mdash" was referenced, but not declared.

How can I avoid this exception?

The HTML file is created from an XSLT transformation from XML. I don't think I necessarily need it to be an —, I'm not sure. The HTML is to be displayed in a Java Swing application.

It's difficult for me to judge what information from my specific implementation is relevant for this problem. Please let me know by comments if more information is needed.


So, I was under the bad misconception that HTML was XML (a result of not actually thinking of that at all).

So, given an HTML file, how do I go about solving this problem?

  • Giving the parser the DTD for HTML 4?
  • Replace — with the equivalent. The HTML is created from an XSLT transform, can the stylesheet be set to replace mdash with the equivalent numeric symbol?
  • Is there any libraries which would fix the HTML before it's given to the parser? I've noticed JTidy being mentioned for similar purposes.
+1  A: 

Given that HTML is not XML I think you might find lots of problems with trying to parse HTML Transitional with an XML parser. But in case your HTML is actually well-formed XML, the mdash and other entities are usually defined in the DTD. Make sure your parser has the DTD for the document and it should be ok.

Mr. Shiny and New
+1 for reminding me HTML is not XML ;-)
Grundlefleck
+1  A: 

The problem is that if the document as presented to Xerces does not have a DTD with mdash declared it is not a well-formed XML document - all entities have to be declared. HTML has a set of "builtin" entities that HTML processors need to know about and these should be in a DTD.

The simplest workaround without a DTD will be to replace mdash by its numeric equivalent (— or —)

peter.murray.rust
Would this problem occur with XHTML?
Grundlefleck
The only place the entity can be defined is in the DTD. So if the file has a DOCTYPE with the DTD, Xerces should retrieve the entity from there.
peter.murray.rust
An added problem is I'm working behind the university proxy, and the lookup times out. I'd prefer not to mess with proxies, is it possible to supply a local DTD file to Xerces, and I can just save the relevant one from w3c.org?
Grundlefleck
Btw, +1 for entities information :-)
Grundlefleck
+2  A: 

The right DTD in the header of your file should contain all the necessary entities declarations, and if your file is well-formed then parser will honor this information.

If there's a chance that HTML is not well-formed, I'd recommend TagSoup library for fixing this. It reads input and try to produce valid XHTML as output, never reporting any parse errors, just trying to fix them using built-in heuristics. I was able to successfully process very broken HTML from the web and perfrom XPath queries over it (seems like this is what you need).

Sergey Mikhanov
+2  A: 

I think I found the problem for my specific situation. The HTML file was generated from an XML file using XSLT. By changing the line:

<xsl:output method="html" />

to:

<xsl:output method="xml" />

The transformation did not create the &mdash; entity. The output file could then be parsed using Xerces.

I'm not sure if this is "correct", but it seems to do the trick for displaying in Swing.

Grundlefleck
Too bad I saw this a bit late. Yes, it's an appropriate and correct solution too. If you can generate "clean" XML you don't need JTidy.
Carl Smotricz
+1  A: 
Carl Smotricz