tags:

views:

2000

answers:

7

When loading XML into an XmlDocument, i.e.

XmlDocument document = new XmlDocument();
document.LoadXml(xmlData);

is there any way to stop the process from replacing entities? I've got a strange problem where I've got a TM symbol (stored as the entity #8482) in the xml being converted into the TM character. As far as I'm concerned this shouldn't happen as the XML document has the encoding ISO-8859-1 (which doesn't have the TM symbol)

Thanks

+2  A: 

I confess things get a little confusing with XML documents and encodings, but I'd hope that it would get set appropriate when you save it again, if you're still using ISO-8859-1 - but that if you save with UTF-8, it wouldn't need to. In some ways, logically the document really contains the symbol rather the entity reference - the latter is just an encoding matter. (I'm thinking aloud here - please don't take this as authoritative information.)

What are you doing with the document after loading it?

Jon Skeet
Eventually I'm outputting the character to a webpage. The problem is that the character is broken on display because i've set the responseEncoding to be ISO-88559-1
Gordon Carpenter-Thompson
How are you writing the data to the web page though? If you write it out using a TextWriter with an encoding of ISO-8859-1 I would expect it to put the right character entity in.(Do you really have to use ISO-8859-1 in the first place, btw?)
Jon Skeet
I'm storing it as a string in a DTO. This is retrieved from the XML by looking for the specific node and then doing string fieldValue = ((XmlNode)fieldListEnum.Current).FirstChild.Value. I eventually write it out to a Repeater using some databinding code
Gordon Carpenter-Thompson
What I don't understand however is if the data is stored in the xml encoding agnostically why it's not working correctly
Gordon Carpenter-Thompson
So you've got the unicode character in FirstChild.Value - it's been decoded from the character entity. It sounds like it's not the XML document which you need to look at, but the repeater.I suggest you ignore the XML for the moment and try to write the character (hard-coded) out to the repeater.
Jon Skeet
thanks for the help Jon
Gordon Carpenter-Thompson
A: 

I beleive if you enclose the entity contents in the CDATA section it should leave it all alone e.g.

<root>
<testnode>
<![CDATA[some text &#8482;]]>
</testnode>
</root>
Andy
+2  A: 

What are you writing it to? A TextWriter? a Stream? what?

The following keeps the entity (well, it replaces it with the hex equivalent) - but if you do the same with a StringWriter it detects the unicode and uses that instead:

    XmlDocument doc = new XmlDocument();
    doc.LoadXml(@"<xml>&#8482;</xml>");
    using (MemoryStream ms = new MemoryStream())
    {
        XmlWriterSettings settings = new  XmlWriterSettings();
        settings.Encoding = Encoding.GetEncoding("ISO-8859-1");
        XmlWriter xw = XmlWriter.Create(ms, settings);
        doc.Save(xw);
        xw.Close();
        Console.WriteLine(Encoding.UTF8.GetString(ms.ToArray()));
    }

Outputs:

    <?xml version="1.0" encoding="iso-8859-1"?><xml>&#x2122;</xml>
Marc Gravell
A: 

Entity references are not encoding specific. According to the W3C XML 1.0 Recommendation:

If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646.

csgero
Maybe not when reading - but they are when writing, since some code-points may not exist in that encoding, thus needing the character reference; so it really comes down to how the OP is *writing* data.
Marc Gravell
+2  A: 

This is a standard misunderstanding of the XML toolset. The whole business with "&#x", is a syntactic feature designed to cope with character encodings. Your XmlDocument isn't a stream of characters - it has been freed of character encoding issues - instead it contains an abstract model of XML type data. Words for this include DOM and InfoSet, I'm not sure exactly which is accurate.

The "&#x" gubbins won't exist in this model because the whole issue is irrelevant, it will return - if appropriate - when you transform the Info Set back into a character stream in some specific encoding.

This misunderstanding is sufficiently common to have made it into academic literature as part of a collection of similar quirks. Take a look at "Xml Fever" at this location: http://doi.acm.org/10.1145/1364782.1364795

Simon Gibbs
A: 
AnthonyWJones
A: 

Thanks for all of the help.

I've fixed my problem by writing a HtmlEncode function which actually replaces all of the characters before it spits them out to the webpage (instead of relying on the somewhat broken HtmlEncode() .NET function which only seems to encode a small subset of the characters necessary)

Gordon Carpenter-Thompson