views:

26

answers:

2

If I execute the following Python 3.1 program, I see only � instead of the correct characters in my browser. The file itself is UTF-8 encoded and the same encoding is sent with the response.

from wsgiref.simple_server import make_server

page = "<html><body>äöü€ßÄÖÜ</body></html>"

def application(environ, start_response):
    start_response("200 Ok", [("Content-Type", "text/html; charset=UTF-8")])
    return page

httpd = make_server('', 8000, application)
print("Serving on port 8000...")
httpd.serve_forever()

"UTF-8" is set correctly in the response:

HTTP/1.0 200 Ok
Date: Mon, 09 Aug 2010 16:35:02 GMT
Server: WSGIServer/0.1 Python/3.1.1+
Content-Type: text/html; charset=UTF-8

What is wrong here?

A: 

Those characters are not UTF-8; they are latin-1. If you put those literals into your Python source code (which you shouldn't do), you need to declare the encoding of the file, by placing the following line at the top:

#-*- coding: latin-1 -*-

and serving in latin-1:

start_response("200 Ok", [("Content-Type", "text/html; charset=latin-1")])

Assuming you meant to do everything in UTF-8, you need to look up the code points for those characters. You can then do

page = u"\x--\x--...\x--"

and serve that up as Unicode.

Note that you can verify this by changing the encoding of your browser; if you manually change it to latin-1 the characters will display fine.

katrielalex
I thought that `#-*- coding: ...` is not needed with Python >= 3. The characters shown could in principal be written with UTF-8 directly.
deamon
+1  A: 

WSGI on Python 3 doesn't exist yet. The Web-SIG have still not reached any conclusion about how strings (bytes/unicode) are to be handled in Python 3.x.

wsgiref is largely an automated 2to3 conversion; it still has problems even apart from the factor of what WSGI on 3.x will actually mean. Don't rely on it as a reference to how WSGI apps will work under Python 3.

That the situation is still like this coming into the 3.2 release cycle is embarrassing and depressing.

return page

Well, whilst WSGI for 3.x is still an unknown factor, one thing most agree on is that the response body of a WSGI app should generally be bytes and not unicode, since HTTP is a bytes-based protocol. Whether Unicode strings will be accepted—and if so what encoding they'll be converted with—remains to be seen, so avoid the issue and return bytes:

return [page.encode('utf-8')]

(The [] are needed because WSGI apps should return an iterable that's output and flushed an item at a time. If you pass a string on its own, that's used as an iterable and returned a character at a time, which is horrible for performance.)

bobince
Thanks for the enlightenment. But `return page.encode('utf-8')` doesn't work. I get the following error from the WSGI runtime: `AssertionError: write() argument must be a string or bytes`.
deamon
It works with `return [page.encode('utf-8')]`.
deamon
Yeah, sorry, I edited the bit about `[]` in afterwards! The case where `[]` is missing fails harder for byte strings than for unicode because in Python 3, `b'A'[0]` is the integer 65, not `b'A'`. Pretty much the worst mistake Python 3 made, IMO.
bobince