ansaurus

Question

problems reading CDATA section with special chars (ISO-8859-1 encoding)

Answer 1

+1 A:

It's possible the file isn't in ISO-8859-1 but is in UTF-8. Can you provide a hex dump of the contents? Sometimes the writer of an XML file isn't careful about the encoding string.

Also, it could be that the XML document comes via HTTP, and the HTTP headers declare the encoding improperly. Section 4.3.3 in the XML specification states that MIME rules override what the document itself states.

If you point your own code at the link instead of your local copy, it could mean your local web server isn't configured properly...

lavinio 2009-06-15 17:38:31

Answer 2

A:

The XML file you mentioned in your follow-up is perfectly correct. So, your bug is specific to your Javascript code.

bortzmeyer 2009-06-16 09:24:51

Javascript code? What do you mean? I dont have and js code?

2009-06-16 11:57:06

OK, you have detected my lack of familiarity with Javascript. Then, what is it? C#? You did not tag the question with the language you use.

bortzmeyer 2009-06-16 12:28:06

Answer 3

+2 A:

To expand on an answer someone else gave:

There are two possibilities:

The file is really encoded as UTF-8, but is being interpreted by your xml parser as ISO-8859-1.
The file is really encoded as ISO-8859-1 but is being interpreted by your xml parser as UTF-8.

To determine which is which, look at what happens with the é in Sébastien. There are two possibilities I can imagine:

"é" becomes a single nonsense charact or "?", and possibly the "b" is also missing from the name Sébastien.

In the first case, your file is not what you think it is. (It is getting to your program as UTF-8 data, but your program is trying to interpret it as ISO-8859-1) Look at the xml file with a hex editor or something else that can show you what the bytes on the disk are.

In the second case, I'd check how the HTTP server on localhost is serving this file. (Your program is getting bytes in ISO-8859-1 format, but is interpreting them as UTF-8) The easiest way to do that on windows is to open up a cmd prompt, and run the command: telnet localhost 80

When that pops up a window, type the following line (or cut-and-paste from stackoverflow) and press enter twice. Warning: You won't be able to see what you're typing, and capitalization is important.

GET /Test/person.xml HTTP/1.0

In the response, look for a line beginning with Content-Type. That will tell you how the webserver locally is serving up the file.

Update: Having looked at your file, it really is iso-8859-1, so what I would suggest is setting the .Encoding attribute of your Webclient instance like so before you tell it to download the file:

client.Encoding = System.Text.Encoding.GetEncoding("iso-8859-1")

Alternatively, you could use the DownloadBytes methods instead of the DownloadString methods, and then parse the bytes into an xml file. The problem currently is that by the time the xml parser gets the file contents, the bytes have already been interpreted as a string, so it's too late to change the encoding there.

Daniel Martin 2009-06-16 22:25:30

ansaurus

tags:

views:

answers:

problems reading CDATA section with special chars (ISO-8859-1 encoding)

related questions