views:

215

answers:

2

I'm using Saxon 9 to analyze invalid html sources. Specifically the html has href values like the following:

<a href="blah.asp?fn=view&g_varID=1234">some text</a>

I'm getting errors:

"Error reported by XML parser: The reference to entity "g_varID" must end with the ';' delimiter."

The xml parser is reading the "&g_varID" string and complaining that there should be a ";" to delimit the entity. But, of course, this is not intended as an HTML entity -- it's just a piece of a URI.

How can I tell the parser to ignore it? Note: I'm using non-schema-aware Saxon, not Saxon-SA.

Thanks

+2  A: 

If your HTML is not XML, then how do you expect any XML processor to process it?

John Saunders
Right, of course. It is invalid and so not xml. But it's well-formed. I guess my more general question is, "can I tell the processor to relax validation enough to get past this string?"
John Turnbull
The processor is processing XML. What do you mean it's well formed but not XML? If you want to process HTML that is not XML, then you use an HTML processor, not an XML processor.
John Saunders
There's a useful distinction between XML that is valid and XML that is only well-formed. I was being clumsy. As Jeff Mc suggested, the solution was in the use of a Doctype. But as is so often the case, the "html" is so far from valid that processing it is a waste of time. Thanks.
John Turnbull
A: 

Make sure you have a correct xhtml DOCTYPE. According to the xhtml1-strict.dtd that I'm looking at, the href attribute is declared CDATA, not PCDATA, which means literal & is perfectly ok and should not be parsed as an entity.

Jeff Mc
In XML, string-typed attributes ("CDATA attributes") can contain entity references. (CDATA sections cannot, but they are a different thing). There's even a specific warning about ampersands in attribute values in an appendix to the XHTML spec: http://www.w3.org/TR/xhtml1/#C_12
Jukka Matilainen
This confusion over "CDATA" originates from the SGML era. There seems to be a good summary here: http://www.flightlab.com/~joe/sgml/cdata.html
Jukka Matilainen