In principle this should work:
>>> urllib.unquote_plus('http://www.foo.com/%E5%9C%B0%E9%9C%87').decode('utf-8')
u'http://www.foo.com/\u5730\u9707'
However, note that:
unquote_plus
is for application/x-form-www-urlencoded
data such as POSTed forms and query string parameters. In the path part of a URL, +
means a literal plus sign, not space, so you should use plain unquote
here.
You shouldn't generally unquote a whole URL. Characters that have special meaning in a component of the URL will be lost. You should split the URL into parts, get the single pathname component (%E5%9C%B0%E9%9C%87
) that you are interested in, and then unquote it.
(If you want to fully convert a URI to an IRI like http://www.foo.com/地震
things are a bit more complicated. Only the path/query/fragment part of an IRI is UTF-8-%-encoded; the domain name is mapped between Unicode and bytes using the oddball ‘Punycode’ IDN scheme.)
This gets received in my python server side.
What exactly is your server-side? Server, gateway, framework? And how are you getting the url
variable?
You appear to be getting a UnicodeEncodeError
, which is about unexpected non-ASCII characters in the input to the unquote
function, not an decoding problem at all. So I suggest that something has already decoded the path part of your URL to a Unicode string of some sort. Let's see the repr
of that variable!
There are unfortunately a number of serious problems with several web servers that makes using Unicode in the pathname part of a URL very unreliable, not just in Python but generally.
The main problem is that the PATH_INFO
variable is defined (by the CGI specification, and subsequently by WSGI) to be pre-decoded. This is a dreadful mistake partly because of issue (1) above, which means you can't get %2F
in a path part, but more seriously because decoding a %
-sequence introduces a Unicode decode step that is out of the hands of the application. Server environments differ greatly in how non-ASCII %
-escapes in the URL are handled, and it is often impossible to recreate the exact sequence of bytes that the web browser passed in.
IIS is a particular problem in that it will try to parse the URL path as UTF-8 by default, falling back to the wildly-unreliable system default codepage (eg. cp1252 on a Western Windows install) if the path isn't a valid UTF-8 sequence, but without telling you. You are then likely to have fairly severe problems trying to read any non-ASCII characters in PATH_INFO
out of the environment variables map, because Windows envvars are Unicode but are accessed by Python 2 and many others as bytes in the system codepage.
Apache mitigates the problem by providing an extra non-standard environ REQUEST_URI
that holds the original, completely undecoded URL submitted by the browser, which is easy to handle manually. However if you are using URL rewriting or error documents, that unmapped URL may not match what you thought it was going to be.
Some frameworks attempt to fix up these problems, with varying degrees of success. WSGI 1.1 is expected to make a stab at standardising this, but in the meantime the practical position we're left in is that Unicode paths won't work everywhere, and hacks to try to fix it on one server will typically break it on another.
You can always use URL rewriting to convert a Unicode path into a Unicode query parameter. Since the QUERY_STRING
environ variable is not decoded outside of the application, it is much easier to handle predictably.