I'm at the receiving end of a HTTP POST (x-www-form-urlencoded), where one of the fields contains an XML document. I need to receive that document, look at a couple of elements, and store it in a database (for later use). The document is in UTF-8 format (and has the appropriate header), and can contain lots of strange characters.
When I receive the data, like this:
Set xmlDoc = CreateObject("MSXML2.DOMDocument.3.0")
xmlDoc.async = False
xmlDoc.loadXML(Request.Form("xml"))
everything I can dig out of the DOM document is still in UTF-8 form. For example, this document (grossly simplified):
<?xml version="1.0" encoding="UTF-8"?>
<data>
ä
</data>
always comes out as
<?xml version="1.0" encoding="UTF-8"?>
<data>
ä
</data>
If I look at xmlDoc.XML, I get this:
<?xml version="1.0"?>
<data>
ä
</data>
It removes the encoding from the header (since whatever string I'm using in VBScript is "encoding-agnostic", this sort of makes sense), but it's still a sequence of characters representing an UTF-8 encoded document.
It's just as if MSXML didn't care about the encoding info in the header. Is the problem with MSXML, or is it with the encoding of the post data? It's a form of "double encoding", first UTF-8 (where certain characters are written with several bytes) and then urlencoded byte by byte ("ä" is actually sent as %C3%A4).
I would not want to hard-code anything such as assuming that it is always UTF-8 (as it could well be UTF-16 sometime in the future). I cannot do a "hard conversion" to any other character set either (such as iso-8859-1), as the data can contain cyrillic and arabic characters. How should I go about fixing this?