views:

1182

answers:

1

I have a problem with classc ASP / VBScript trying to read an UTF-8 encoded XML file with MSXML. The file is encoded correctly, I can see that with all other tools.

Constructed XML example:

<?xml version="1.0" encoding="UTF-8"?>
<itshop>
    <Product Name="Backup gewünscht" />
</itshop>

If I try to do this in ASP...

Set fso = Server.CreateObject("Scripting.FileSystemObject")
Set ts = fso.OpenTextFile("input.xml", FOR_READING)
XML = ts.ReadAll
ts.Close
Set ts = nothing
Set fso = Nothing

Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")
myXML.loadXML(XML)
Set DocElement = myXML.documentElement
Set ProductNodes = DocElement.selectNodes("//Product")
Response.Write ProductNodes(0).getAttribute("Name")
' ...

... and Name contains special characters (german umlauts to be specific) the bytes of the umlaut "two-byte-code" get reencoded, so I end up with two totally crappy nonsense characters. What should be "ü" becomes "ü" - being FOUR bytes in my output, not two (correct UTF-8) or one (ISO-8859-#).

What am I doing wrong? Why is MSXML thinking that the input is ISO-8859-# so that it tries to convert it to UTF-8?

+4  A: 
Set ts = fso.OpenTextFile("input.xml", FOR_READING, False, True)

The last parameter is the "Unicode" flag.

OpenTextFile() has the following signature:

object.OpenTextFile(filename[, iomode[, create[, format]]])

where "format" is defined as

Optional. One of three Tristate values used to indicate the format of the opened file. If omitted, the file is opened as ASCII.

And Tristate is defined as:

TristateUseDefault  -2   Opens the file using the system default.
TristateTrue        -1   Opens the file as Unicode.
TristateFalse        0   Opens the file as ASCII.

And -1 happens to be the numerical value of True.

Anyway, better is:

Set myXML = Server.CreateObject("Msxml2.DOMDocument.4.0")
myXML.load("input.xml")

Why should you use a TextStream object to read in a file that MSXML can read perfectly on it's own.

The TextStream object also has no notion of the actual file encoding. The docs say "Unicode", but there is more than one way of encoding Unicode. The load() method of the MSXML object will be able to deal with all of them.

Tomalak
If I do that, I get an XML Parse error. :(
BlaM
Try the "myXML.load()" variant. If that fails as well, the file is not well formed.
Tomalak
Great, that works. I had the loadXML version because there is another way of submitting the XML (which is by inserting the XML code into a form field). Guess I will just disable that option. Nobody will need that anyway if there is an upload function :)
BlaM
You can also just wrap that in an If .. Then ... Else. :-)
Tomalak
(I already put that in an If...Then...Else, but it still breaks character encoding if I use "loadXML")
BlaM
But the load(...) works?
Tomalak
Forget everything I said. load works, loadXML works. I just believed the bug report I got without trying myself.
BlaM
Confirmed: Everything works as expected now. Thanks for your great help!
BlaM
You are welcome. Never trust a user. ;-)
Tomalak
myXML.load("input.xml") is sort of silly, the parentheses don't belong there. This sort of thing is a bad habit and will burn you elsewhere. The FSO can only handle ASCII and "Unicode" (i.e. UTF-16LE). When you want to read/write alternate text encodings or binary data consider using ADODB.Stream instead. It is handy and quite versatile. Clearly not required at all in this case though.
Bob Riemersma