views:

709

answers:

3

I am trying to read a xml stream and load it into a collection.

This works but Im having difficulties reading special chars.

E.g. if my xml looks like this

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<persons>
<person>
 <firstname>
 <![CDATA[ Sébastien ]]> 
  </firstname>
  <lastname>
   <![CDATA[Ørvåk]]> 
  </lastname>
</person>
</persons>

I try to read the values using linq like

var persons = from p in doc.Elements("persons").Elements("person") select p;
string firstname = person.Element("firstname").Value;
string lastname = person.Element("lastname").Value;

but in Ørvåk Ø and å / Sébastien the é gives strange chars.

Does anyone know whats wrong? I guess it doesnt use the encoding ISO-8859-1.

Thanks

+1  A: 

It's possible the file isn't in ISO-8859-1 but is in UTF-8. Can you provide a hex dump of the contents? Sometimes the writer of an XML file isn't careful about the encoding string.

Also, it could be that the XML document comes via HTTP, and the HTTP headers declare the encoding improperly. Section 4.3.3 in the XML specification states that MIME rules override what the document itself states.

If you point your own code at the link instead of your local copy, it could mean your local web server isn't configured properly...

lavinio
A: 

The XML file you mentioned in your follow-up is perfectly correct. So, your bug is specific to your Javascript code.

bortzmeyer
Javascript code? What do you mean? I dont have and js code?
OK, you have detected my lack of familiarity with Javascript. Then, what is it? C#? You did not tag the question with the language you use.
bortzmeyer
+2  A: 

To expand on an answer someone else gave:

There are two possibilities:

  1. The file is really encoded as UTF-8, but is being interpreted by your xml parser as ISO-8859-1.
  2. The file is really encoded as ISO-8859-1 but is being interpreted by your xml parser as UTF-8.

To determine which is which, look at what happens with the é in Sébastien. There are two possibilities I can imagine:

  1. "é" becomes two different characters - probably "é"
  2. "é" becomes a single nonsense charact or "?", and possibly the "b" is also missing from the name Sébastien.

In the first case, your file is not what you think it is. (It is getting to your program as UTF-8 data, but your program is trying to interpret it as ISO-8859-1) Look at the xml file with a hex editor or something else that can show you what the bytes on the disk are.

In the second case, I'd check how the HTTP server on localhost is serving this file. (Your program is getting bytes in ISO-8859-1 format, but is interpreting them as UTF-8) The easiest way to do that on windows is to open up a cmd prompt, and run the command: telnet localhost 80

When that pops up a window, type the following line (or cut-and-paste from stackoverflow) and press enter twice. Warning: You won't be able to see what you're typing, and capitalization is important.

GET /Test/person.xml HTTP/1.0

In the response, look for a line beginning with Content-Type. That will tell you how the webserver locally is serving up the file.

Update: Having looked at your file, it really is iso-8859-1, so what I would suggest is setting the .Encoding attribute of your Webclient instance like so before you tell it to download the file:

client.Encoding = System.Text.Encoding.GetEncoding("iso-8859-1")

Alternatively, you could use the DownloadBytes methods instead of the DownloadString methods, and then parse the bytes into an xml file. The problem currently is that by the time the xml parser gets the file contents, the bytes have already been interpreted as a string, so it's too late to change the encoding there.

Daniel Martin