tags:

views:

535

answers:

2

How can I resolve all entity references in the XHTML document and convert it to plain XHTML document that IE can understand? The example XHTML:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
    <!ENTITY D "&#x2014;">
    <!ENTITY o "&#x2018;">
    <!ENTITY c "&#x2019;">
    <!ENTITY O "&#x201C;">
    <!ENTITY C "&#x201D;">
]>
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
    <head>
    </head>
    <body>
        &O; &C;
    </body>
</html>
+2  A: 

Turns out this is simple option in the XmlTextReader (and XmlValidatingReader) class - "EntityHandling".

So a simple demo of your problem:

System.Xml.XmlTextReader textReader = new System.Xml.XmlTextReader("testin.xml");
textReader.EntityHandling = System.Xml.EntityHandling.ExpandEntities;
System.Xml.XmlDocument outputDoc = new System.Xml.XmlDocument();
outputDoc.Load(textReader);
System.Xml.XmlDocumentType docTypeIfPresent = outputDoc.DocumentType;
if (docTypeIfPresent != null)
    outputDoc.RemoveChild(docTypeIfPresent);
outputDoc.Save("testout.html");
textReader.Close();

And if you prefer not to have to load the document into memory, a streaming equivalent:

System.Xml.XmlTextReader textReader = new System.Xml.XmlTextReader("testin.xml");
textReader.EntityHandling = System.Xml.EntityHandling.ExpandEntities;
System.Xml.XmlTextWriter textWriter = new System.Xml.XmlTextWriter("testout.html", System.Text.Encoding.UTF8);
while (textReader.Read())
{
    if (textReader.NodeType != System.Xml.XmlNodeType.DocumentType)
        textWriter.WriteNode(textReader, false);
    else
        textReader.Skip();
}
textWriter.Close();
Tao
XmlWriterSettings writerSettings = new XmlWriterSettings();writerSettings.OmitXmlDeclaration = true;XmlWriter xmlWriter = XmlWriter.Create(htmlFileName, writerSettings);outputDoc.Save(xmlWriter);xmlWriter.Close();
Priyank Bolia
Hi, I don't understand the comment - does OmitXmlDeclaration also omit the DTD? Wouldn't it have the undesirable side-effect of also actually removing the XML declaration? (which in turn could cause encoding issues)
Tao
replace the line outputDoc.Save("testout.html");with my code, so that the xml declaration is omitted, which cause generate a plain html instead of XML
Priyank Bolia
A: 

xmllint can do it and, since xmllint is written in C and is free software, it is probable relatively easy to adapt the way it does it to your C# program. Here is an example:

% cat foo.xhtml 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html [
    <!ENTITY D "&#x2014;">
    <!ENTITY o "&#x2018;">
    <!ENTITY c "&#x2019;">
    <!ENTITY O "&#x201C;">
    <!ENTITY C "&#x201D;">
]>
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
    <head>
    </head>
    <body>
        &O; &C;
    </body>
</html>

% xmllint --noent --dropdtd foo.xhtml
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
    <head>
    </head>
    <body>
        [Plain Unicode characters that I prefer to omit because I don't know how SO handles it]
    </body>
</html>
bortzmeyer