views:

185

answers:

2

I have an ASP Access database that contains strings in various European languages. The database was populated prior by agents in the respective countries. It contains entries with accented etc characters as you would expect. If I open the database with MS Access these characters show up fine. For example the the German equivalent of "Open" shows as "Öffnen" (hopefully you can see an "O" with 2 dots above it!).

I have ASP code that reads the database and returns records in XML. The text is passed to XMLEncode to construct the XML, but that only seems to deal with the 5 specials like "<", "&", etc. If I dump the XML the accented characters are unchanged.

<English>Open</English>
<German>Öffnen</German> 

If I look at the raw packets with Wireshark I see that the "Ö" byte is hex D6, which appears to be it's decimal Unicode and ISO 8859-1 value.

The problem starts when I try to parse the XML in client-side JS. I get:

"An invalid character was found in text content"

from IE. FF and Chrome happily accept the XML without hiccup but the browser shows the "Ö" character as a diamond with a question mark inside.

http://www.validome.org/xml/validate/ reports "encoding error."

http://www.w3schools.com/dom/dom_validate.asp thinks it is fine.

The XML is UTF-8 encoded.

What do I need to do to have IE accept my XML without complaint?

What do I need to do to have browsers display the stuff correctly?

+1  A: 

How do you know the XML is UTF-8 encoded? I don't know the MS environment well, but in Java a common problem is to assume that just writing the encoding="UTF-8" header causes it to be UTF-8 encoded. You also have to configure the writer to actually write UTF-8.

You said Wireshark shows hex D6, which would indicate the stream is actually NOT UTF-8 encoded, regardless of what the header says.

Jim Garrison
A: 

Well, I'm not entirely sure why, but I was able to get it working. Prompted by Jim's comments I changed the XML and response encoding back from 8859-1 to UTF-8, and also the encoding in the META tag for the pages.

It now works without complaint in IE, and the browsers now display the correct characters.

I also checked the raw bytes with Wireshark this time and the "Ö" character is being encoded in the XML as 2 bytes (0xC3, 0x96), instead of 1 byte of 0xD6.

So in summary:

In the server-side ASP code to generate the XML response header:

return ("<?xml version=\"1.0\" encoding=\"UTF-8\"?>") ;

In the server-side ASP code to generate the response itself:

Response.ContentType = "text/xml; charset=UTF-8" ;
Response.Write (XMLResponse) ;

and in the web page header:

<head>
  <meta http-equiv="Content-type" content="text/html; charset=UTF-8"> 

Many thanks Jim.