views:

366

answers:

5

I assumed that any data being sent to my parameter strings would be utf-8, since that is what my whole site uses throughout. Lo-and-behold I was wrong.

For this example has the character ä in utf-8 in the document (from the query string) but proceeds to send a B\xe4ule (which is either ISO-8859-1 or windows 1252) when you click submit. It also fires off a ajax request which also fails from trying to decode the non-utf8 character.

An in django, my request.POST is really screwed up :

>>> print request.POST
<QueryDict: {u'alias': [u'eu.wowarmory.com/character-sheet.xml?r=Der Rat von Dalaran&cn=B\ufffde']}>

How can I just make all these headaches go away and work in utf8?

A: 

Although it's AFAIK not specified anywhere, all browsers use the character encoding of the HTML page, on which the form is embedded as the encoding for submitting the form back to the server. So if you want the URL parameters to be UTF-8-encoded, you have to make sure that the HTML page, on which the form is embedded, is also UTF-8 encoded.

jarnbjo
well, the page I gave as an example was served in UTF-8 right?
Paul Tarjan
That depends entirely on the headers you declared in the page. And, possibly, also on the HTTP headers sent by the server behind your back (this one can be tricky).
Arthur Reutenauer
Well, checking my page : `<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />` and headers : `Content-Type text/html; charset=utf-8`
Paul Tarjan
OK, it took me some time to check it because the URL you gave was not really helpful. But you're right, the HTML declaration is correct. What about the HTTP headers then?
Arthur Reutenauer
I'm not able to check the page you are linking to since it requires a login, and using my OpenID causes an error. Sorry, but you could try to make it a little bit easier to help you ...
jarnbjo
Arthur Reutenauer
Actually the page is the metaward.com page trying to import the eu.wowarmory.com page, which is causing the problem. And the OpenID login is actually working, but the crash is being caused by the non-utf8 character that is in the query string. Sorry for the pain! Also, the page I linked fires off a ajax request which is also failing due to the umlaut.
Paul Tarjan
Woo, that's tough :-) But i really have no idea about your question, sorry.
Arthur Reutenauer
When I try to login with my OpenID from myopenid.com I still get the following error page: You broke the award system! Use the feedback button on the right to let me know what dastardly deed you were doing to break it, and I'll give you a shiny award for your efforts."
jarnbjo
What Paul means is that if you suppress the ä in the URL you can actually log in. Which enabled me to check that the page is actually served in UTF-8. So the problem must really be with the way Python encodes the query when submitting it to the form.
Arthur Reutenauer
Ok, I get it now. What I see is that the add function obviously decodes the URL parameter correctly, since it is rendered properly in the generated HTML page (in the form field). The AJAX request, which is generated by the page coming from the add URL encodes the request correctly, but still the handling of the parse URL obviously fails, since the server generates an HTTP 500 response. Paul, is there any obvious reason in your server code why the add URL works and the parse URL fails?
jarnbjo
A: 

According to http://stackoverflow.com/questions/544071/get-non-utf-8-form-fields-as-utf-8-in-php, you'll need to make sure the page itself is served up using UTF8 encoding.

Justin Grant
well, the page I gave as an example was served in UTF-8 right?
Paul Tarjan
A: 

Since Django 1.0 all values you get from form submission are unicode objects, not bytestrings like in Django 0.96 and earlier. To get utf-8 from your values encode them with utf-8 codec:

request.POST['somefield'].encode('utf-8')

To get query parameters decoded properly, they have to be properly encoded first:

In [3]: urllib.quote('ä')
Out[3]: '%C3%A4'

I think your problem comes from bad encoding of query parameters.

zgoda
+1  A: 

You should also add accept-charset="UTF-8" to the <form/> tag.

Cixate
A: 

Getting an utf-8 string from the submitted form should just be a matter of encoding the unicode object:

next = request.POST['next'].encode('utf-8')

For the AJAX request, can you confirm that that request is also being sent as utf-8 and declared as utf-8 in the headers?