tags:

views:

44

answers:

1

I want to reference an URL having character entities using C#/.NET on an XmlReader instance, for example this w3c entity set defining   and other chars.

If I were to accomplish it in pure XML it would something like this, or a variation:
<!ENTITY foo SYSTEM "http://example.org/myent.ent"&gt;

I'm actually reading fragments of XHTML source (containing named entities) and therefore need the XML 1.0/HTML 4 named Entity Sets defined by w3c to be defined/recognized.
(I suppose I'm asking how to programmatically reference them on-the-fly while setting up the XmlReader and its Settings for reading fragments; however I'm open to options).

Either way, if I don't include these named entities the reader will cough and produce .NET errors such as the following XmlException for &nbsp; and other non-numeric entities:

Test 'Xml_Tester.Test_Reading' failed: System.Xml.XmlException : Reference to undeclared entity 'nbsp'. Line 6, position 393.

Note: I'm successfully referencing an XHTML Schema using the XmlReaderSettings.Schemas collection property, and assume there has to be an equally easy way to call in external entity references without modifying the XML source, but it evades me.


Etc:

I've come across the following significant info bits while searching for an answer -they're likely useful here ...

Support for entities
To use entities, authors have to use the DTD mechanism. See section 1.5 to use DTD and XML Schema together. -- http://www.w3.org/TR/xhtml1-schema/#diffs

1.5. Using DTD and XML Schema together
DTD validation and XML Schema validation are not mutually exclusive. Sometimes authors might want to use some DTD features (e.g. entities) while taking advantage of the XML Schema validation. -- http://www.w3.org/TR/xhtml1-schema/#together

Combining XML Documents with XInclude
External entities must be declared in a DTD or an internal subset. This opens a Pandora's Box full of implications, such as the fact that the document element must be named in Doctype declaration and that validating readers may require that the full content model of the document be defined in DTD among others.
-- http://msdn.microsoft.com/en-us/library/aa302291.aspx#xinc_topic1

A: 

Found an answer to use XMLReader instance to read XHTML source that includes named entities like &nbsp; without throwing an XmlException

To start I copied the following XML sample directly from W3C's page: XHTML 1.0 in XML Schema, section 1.5. Using DTD and XML Schema together in support of bringing in the named entity chars and having Schema based validation at the same time:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ATTLIST html
    xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation CDATA #IMPLIED
>
]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/1999/xhtml
                          http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd"&gt;
  ...
</html>

and am substituting XHTML fragments e.g. <body><div><b>xhtml stuff</b></div></body> into the location of ... in the above sample.

This successfully mixes DTD (to reference named entities) with Schema validation. The XMLReader no longer throws an XMLExeption when a named entity is encountered.
Success!


The C#.NET code that processes the above sample

using System;
using System.IO;
using System.Xml;

The core logic follows. Note: This is copied and pasted verbatim. Some of the settings might be frivolous or redundant so you can tweak to achieve various other mileage.

XmlReaderSettings settingsXRdr = new XmlReaderSettings();
settingsXRdr.ProhibitDtd = false;
settingsXRdr.CheckCharacters = true;
settingsXRdr.ConformanceLevel = ConformanceLevel.Document;
settingsXRdr.IgnoreProcessingInstructions = false;
settingsXRdr.IgnoreComments = false;
settingsXRdr.XmlResolver = new CustomXmlResolver();
settingsXRdr.ValidationType = ValidationType.DTD;

// This is a format string; notice the placeholder {0} where the fragment will be injected:

string mixFmtString1 = @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Strict//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd""[
<!ATTLIST html
xmlns:xsi CDATA #FIXED ""http://www.w3.org/2001/XMLSchema-instance""
xsi:schemaLocation CDATA #IMPLIED
>
]>
<html xmlns=""http://www.w3.org/1999/xhtml"" lang=""en"" xml:lang=""en""
xmlns:xsi=""http://www.w3.org/2001/XMLSchema-instance""
xsi:schemaLocation=""http://www.w3.org/1999/xhtml
              http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd""&gt;
<head><title></title></head>
<body>
<div>{0}</div>
</body>
</html>";


// Inject any well-formed fragment via the second argument
string xhtml = string.Format(mixFmtString1, "<b>Xhtml fragment w/named entity: &nbsp;</b>");

// Creates a validating reader (derived type) because of the above settings)
XmlReader rdr = XmlReader.Create(new StringReader(xhtml), settingsXRdr);

// Reads the entire XHTML document (validating it along the way).
while (rdr.Read()) {

    // Do whatever you want here for each piece processed.
    var dummy = rdr.NodeType.ToString();  // Access a string value for fun.
    // If you just want validation to occur then leave this an empty code block. 

}

Note: This solution uses the Strict template for XHTML so certain deprecated tags like <center> will fail the reader. You might want to reformulate the referenced items to point to the more forgiving loose XHTML template.

Related/useful resources from along the way:

John K