views:

559

answers:

2

I'm at the receiving end of a HTTP POST (x-www-form-urlencoded), where one of the fields contains an XML document. I need to receive that document, look at a couple of elements, and store it in a database (for later use). The document is in UTF-8 format (and has the appropriate header), and can contain lots of strange characters.

When I receive the data, like this:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.3.0")
xmlDoc.async = False
xmlDoc.loadXML(Request.Form("xml"))

everything I can dig out of the DOM document is still in UTF-8 form. For example, this document (grossly simplified):

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

always comes out as

<?xml version="1.0" encoding="UTF-8"?>
<data>
 ä
</data>

If I look at xmlDoc.XML, I get this:

<?xml version="1.0"?>
<data>
 ä
</data>

It removes the encoding from the header (since whatever string I'm using in VBScript is "encoding-agnostic", this sort of makes sense), but it's still a sequence of characters representing an UTF-8 encoded document.

It's just as if MSXML didn't care about the encoding info in the header. Is the problem with MSXML, or is it with the encoding of the post data? It's a form of "double encoding", first UTF-8 (where certain characters are written with several bytes) and then urlencoded byte by byte ("ä" is actually sent as %C3%A4).

I would not want to hard-code anything such as assuming that it is always UTF-8 (as it could well be UTF-16 sometime in the future). I cannot do a "hard conversion" to any other character set either (such as iso-8859-1), as the data can contain cyrillic and arabic characters. How should I go about fixing this?

+3  A: 

Option 1

Before reading any form fields modify your response code page:-

Response.CodePage = 65001

The problem is the content of the form data is not understood by the receiving page to be in UTF-8. Hence %C3%A4 is seen as two distinct ANSI characters. The pages Reponse.CodePage weirdly influences how the form data is decode in the absence of character set info sent by the client.

Option 2

Modify the form element on the source page. Add the following attribute to to it:-

<form accept-charset="UTF-8" ...>

This enforces UTF-8 encoding of the characters in the post but it also cause the post to carry data about the chosen charset, this gives the server the info it needs to decode correctly.

Option 3

Finally, my preference, don't post XML as field value of a form. Turn it round, add the other form field values as attributes or elements to the XML then post the xml using XmlHttpRequest. For navigation have the server return a URL to which the client should navigate, the URL would contain at GUID handle to the posted data so that when the server receives the request it can take the appropriate action. This is all a bit more work, one of the other two options should do for you fine.

AnthonyWJones
A: 

Option 3 can be pretty much ruled out at the moment due to the added complexity of such a rewrite.

Option 1 just seems strange to me, that the codepage of the response should dictate what happens with the request, but if that's the way it is, then so be it.

As for option 2, it's not really a browser form posting, but a small script client (using CURL). What would be the the resulting HTTP header sent from that, that could be added to the scripted request?

In all, I guess this means that MSXML simply ignores whatever encoding is set in the xml header when loading from a string.

ionn
@ionn: I'm a little confused are you part of a team with @jstck? For option 2 you might try adding the header "Accept-Charset: UTF-8" to the request headers being sent. However this also is a bit weird since it is actually stating what the required __response__ charset should be. I find option 1 to be more reliable. I don't know CURL but in scripting environments option 3 is way preferable, perhaps CURL is different though.
AnthonyWJones