views:

311

answers:

2

I'm working with XML data from an application where we get XML like this:

<elt attrib="Swedish: &#228; &#246; Euro: &#128; Quotes: &#145; &#146; &#147; &#148;">
Swedish: &#228; &#246; Euro: &#128; Quotes: &#145; &#146; &#147; &#148;
</elt>

I want the attribute value and inner text values to be

Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”

but code like this:

Dim sXml As String = "<?xml version = ""1.0"" encoding = ""Windows-1252""?>" & vbCrLf & _
  "<elt attrib=""Swedish: &#228; &#246; Euro: &#128; Quotes: &#145; &#146; &#147; &#148;"">" & _
  "Swedish: &#228; &#246; Euro: &#128; Quotes: &#145; &#146; &#147; &#148;" & _
  "</elt>"

Dim X As New XmlDocument
X.LoadXml(sXml)

TextBox1.Text = "Attribute: {" & X.DocumentElement.Attributes("attrib").Value & "}" & _
  vbCrLf & "InnerText: {" & X.DocumentElement.InnerText & "}" & vbCrLf & _
  "Length: " & Convert.ToString(Len(X.DocumentElement.InnerText))

or this:

Dim X As XDocument = XDocument.Parse(sXml)

TextBox1.Text = "Attribute: {" & X.Root.Attribute("attrib").Value & "}" & _
  vbCrLf & "InnerText: {" & X.Root.Value & "}" & vbCrLf & _
  "Length: " & Convert.ToString(Len(X.Root.Value))

give me:

{Swedish: ä ö Euro: € Quotes: ‘ ’ “ ”}

They both have the length correct at 36, so apparently where I want the Euro and quotes I'm getting something else, presumably based on a Unicode encoding.

A: 

Please don't ever manipulate XML via the String type. It will very often mess things up.

Your test examples are not using the real data file, are they? Be sure to test what you're going to use. You have no idea how the tests differ from reality. You need to take one of the files you'll be processing, and use XDocument.Load to read it in.

After that, go take a look at the attribute values, character by character.


I tried the following, and it worked:

using (var reader = XmlReader.Create(@"..\..\..\..\Swedish.xml"))
{
    var sw = XDocument.Load(reader);
    var element = sw.Element("elt");
    if (element != null)
    {
        var attribute = element.Attribute("attrib");
        if (attribute != null)
        {
            var v = attribute.Value;
            for (var i=0; i<36; i++)
            {
                var c = v[i];

                Console.WriteLine("v[{0}]={1} \t('{2}')", i,(int) c, c);
            }

            Console.WriteLine();
        }
    }
}

The output was:

v[0]=83         ('S')
v[1]=119        ('w')
v[2]=101        ('e')
v[3]=100        ('d')
v[4]=105        ('i')
v[5]=115        ('s')
v[6]=104        ('h')
v[7]=58         (':')
v[8]=32         (' ')
v[9]=228        ('ä')
v[10]=32        (' ')
v[11]=246       ('ö')
v[12]=32        (' ')
v[13]=69        ('E')
v[14]=117       ('u')
v[15]=114       ('r')
v[16]=111       ('o')
v[17]=58        (':')
v[18]=32        (' ')
v[19]=128       ('?')
v[20]=32        (' ')
v[21]=81        ('Q')
v[22]=117       ('u')
v[23]=111       ('o')
v[24]=116       ('t')
v[25]=101       ('e')
v[26]=115       ('s')
v[27]=58        (':')
v[28]=32        (' ')
v[29]=145       ('?')
v[30]=32        (' ')
v[31]=146       ('?')
v[32]=32        (' ')
v[33]=147       ('?')
v[34]=32        (' ')
v[35]=148       ('?')

I presume the question marks are due to whatever my console was set to, but you can see that the numeric values are correct.

John Saunders
A: 

First of all, numeric character entities are interpreted the same regardless of what the encoding of the input file. XML is defined strictly in terms of Unicode (any other encoding is mapped onto Unicode first), and numeric character entities represent Unicode codepoints.

Because of that, your XML, when treated as XML, has precisely the semantic meaning that you've got out of it using XmlDocument, and no other. If you want to get another result, then you are really trying to parse it as not-quite-XML. Which is something no .NET XML API will let you do, not even XmlReader (because it really isn't supposed to be something that you can customize).

The closest you can come to that is to first preprocess the input "XML" as text, replacing those numeric character entities with correct Unicode codepoints - for example, using Regex. This can be tricky, however, because doing so for arbitrary input XML will require you to be able to distinguish where the expansion should not take place (e.g. inside CDATA blocks).

Pavel Minaev
Guess they are outputting what you term "not quite XML". In our case Regex may work OK, because there are no CDATA blocks. Any other likely trouble spots come to mind along the lines of CDATA? Thanks for the explanation and suggestion.
Marc S
Technically, character and entity references aren't expanded in comments and PIs as well, but I'm pretty sure you won't care about the former, and find it very unlikely that you'd care about the latter.
Pavel Minaev
Right, neither comments nor PI's occur in the data we're getting. Thanks, this is very helpful.
Marc S