views:

385

answers:

5

Platform: App Engine Framework: webapp / CGI / WSGI

On my client side (JS), I construct a URL by concatenating a URL with an unicode string:

http://www.foo.com/地震

then I call encodeURI to get

http://www.foo.com/%E5%9C%B0%E9%9C%87

and I put this in a HTML form value.

The form gets submitted to PayPal, where I've set the encoding to 'utf-8'.

PayPal then (through IPN) makes a post request on the said URL.

On my server side, WSGIApplication tries to extract the unicode string using a regular expression I've defined:

(r'/paypal-listener/(.+?)', c.PayPalIPNListener)

I'd try to decode it by calling

query = unquote_plus(query).decode('utf-8')

(or a variation) but I'd get the error

/paypal-listener/%E5%9C%B0%E9%9C%87

... (ommited) ...

'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

(the first line is the request URL)

When I check the length of query, python says it has length 18, which suggests to me that '%E5%9C%B0%E9%9C%87' has not been encoded in anyway.

+1  A: 
David Morrissey
I'm on app engine (see edit).
Gilbert
try `unquote_plus(str(url)).decode('utf-8')` and see if that works - it looks like the values are `unicode` types with values which haven't been properly unquoted/"utf-8" decoded
David Morrissey
str(url) is a bad habit. you should do unquote_plus(url.encode('ascii')).decode('utf-8') if 'url' is a unicode string.
Glyph
@Glyph: OK, I'll remember that from now on. It doesn't take away from that it shouldn't be necessary at all to do `str(url)` or `url.encode('ascii')` though - I think web arguments normally aren't `unicode` types unless they've been previously percentage decoded by the web framework/server. If e.g. `cp1252` has been used to previously decode it then it'll still fail on non-ascii characters unless `url.encode('ascii', 'replace')` etc is used (which would lose some characters, hence why I want more info to try to get to the source of the problem :-)
David Morrissey
A: 

Usually there is a function in server-side languages to decode urls, there might be one in Python as well. You can also use the decodeURIComponent() function of javascript in your case.

Sarfraz
A: 

aaaah, the dreaded

'ascii' codec can't encode characters in position... ordinal not in range

error. unavoidable when dealing with languages like Japanese in python...

this is not a url encode/decode issue in this case. your data is most likely already decoded and ready to go.

i would try getting rid of the call to 'decode' and see what happens. if you get garbage but no error it probably means people are sending you data in one of the other lovely japanese specific encodings: eucjp, iso-2022-jp, shift-jis, or perhaps even the elusive iso-2022-jp-ext which is nowadays only rarely spotted in the wild. this latter case seems pretty unlikely though.

edit: id also take a look at this for reference: http://stackoverflow.com/questions/447107/whats-the-difference-between-encode-decode-python-2-x

blackkettle
-1. encode/decode errors are perfectly avoidable in Python.
Daniel Roseman
it is encoded in UTF-8 and not a Japanese-specific encoding as `urllib.unquote_plus('%E5%9C%B0%E9%9C%87').decode('utf-8')` gives `地震` which is the correct result
David Morrissey
yeah i realize that encode/decode errors are avoidable... that was not the point i was trying to make.
blackkettle
@david now im a bit confused. your post says, "unquote_plus(url).decode('utf-8')" throws an error but in the comment above you say, "urllib.unquote_plus('%E5%9C%B0%E9%9C%87').decode('utf-8')" works as expected. the only reasonable conclusion then is that the 'url' value is not what you are testing in the comment, right? it seems it must be something else, and id be willing to bet that that something else is not utf-8.
blackkettle
@david also, app engine and python make it a decent bet that you are working with django, and the django docs say that form data and GET params will be returned as unicode data when you access them, so there is no need to decode it yourself again, http://docs.djangoproject.com/en/dev/ref/unicode/see the forms section at the very bottom of the page. i believe pylons works the same way in case you are using that framework.
blackkettle
The above *doesn't fail* because it's a `str` type and not a `unicode` type (i.e. `urllib.unquote_plus(u'%E5%9C%B0%E9%9C%87').decode('utf-8')` will fail) I still don't understand why it doesn't decode it properly myself though (which is what it should be doing), maybe the OP needs to be asked whether he/she's using Django, WebOb etc :-)
David Morrissey
@david right - they arent the same thing, which is what i was trying to point out, but... urllib.unquote_plus(u'%E5%9C%B0%E9%9C%87'.encode("utf8")) will work with the unicode string. you could then run decode on that: urllib.unquote_plus(u'%E5%9C%B0%E9%9C%87'.encode("utf8")).decode("utf8") if you were so inclined. also, sorry at some point i think i confused you with the OP.
blackkettle
A: 

urllib.unquote() doesn't like unicode-string in this case. Pass it byte-string and decode afterwards to get unicode.

This works:

>>> u = u'http://www.foo.com/%E5%9C%B0%E9%9C%87'
>>> print urllib.unquote(u.encode('ascii'))
http://www.foo.com/地震
>>> print urllib.unquote(u.encode('ascii')).decode('utf-8')
http://www.foo.com/地震

This doesn't (see also urllib.unquote decodes percent-escapes with Latin-1):

>>> print urllib.unquote(u)
http://www.foo.com/å °é  

Decoding string that already unicode doesn't work:

>>> print urllib.unquote(u).decode('utf-8')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File ".../lib/python2.6/encodings/utf_8.py", line
16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-24: o
rdinal not in range(128)
J.F. Sebastian
This is rather confusing. Your `u.encode('utf-8')` is unexplained and peculiar. The process of making a valid URL out of a ORIGINAL unicode url is (1) encode it in UTF-8 (2) percent-escape it. After that it's an 8-bit str object whose characters are a subset of ASCII. It appears that the OP's url has been then decoded (by just about any encoding that doesn't involve EBCDIC) and he has a unicode string. To reverse the process, start with `bad_url.decode('ascii')` (not utf8). If that fails, the hypothesis is incorrect and we'll have to hope that he gets around to showing us `repr(bad_url)`.
John Machin
yeah, it should be `urllib.unquote_plus(str(u)).decode('utf-8')`. `unquote` doesn't decode percent escapes as Latin-1 as far as I know *unless* the input string is a `unicode` and not a `str` type - it just decodes as encoded binary data (i.e. in utf-8 or whatever) otherwise if it's a `str`.
David Morrissey
@John Machin: you're right `u.encode('utf-8')` is unnecessary due to valid url must be ascii.
J.F. Sebastian
@David Morrissey: The bug description that I've linked says that (percent escapes in unicode being decoded as Latin-1)
J.F. Sebastian
Fair enough, why the arguments are unicode at all still isn't solved though - if the encoding is set to e.g. `latin-1` in the web framework then it's still going to fail converting to `str` if certain codes outside `ascii` are already in the string - more info needed to solve this I think :-P
David Morrissey
+1  A: 

In principle this should work:

>>> urllib.unquote_plus('http://www.foo.com/%E5%9C%B0%E9%9C%87').decode('utf-8')
u'http://www.foo.com/\u5730\u9707'

However, note that:

  1. unquote_plus is for application/x-form-www-urlencoded data such as POSTed forms and query string parameters. In the path part of a URL, + means a literal plus sign, not space, so you should use plain unquote here.

  2. You shouldn't generally unquote a whole URL. Characters that have special meaning in a component of the URL will be lost. You should split the URL into parts, get the single pathname component (%E5%9C%B0%E9%9C%87) that you are interested in, and then unquote it.

(If you want to fully convert a URI to an IRI like http://www.foo.com/地震 things are a bit more complicated. Only the path/query/fragment part of an IRI is UTF-8-%-encoded; the domain name is mapped between Unicode and bytes using the oddball ‘Punycode’ IDN scheme.)

This gets received in my python server side.

What exactly is your server-side? Server, gateway, framework? And how are you getting the url variable?

You appear to be getting a UnicodeEncodeError, which is about unexpected non-ASCII characters in the input to the unquote function, not an decoding problem at all. So I suggest that something has already decoded the path part of your URL to a Unicode string of some sort. Let's see the repr of that variable!

There are unfortunately a number of serious problems with several web servers that makes using Unicode in the pathname part of a URL very unreliable, not just in Python but generally.

The main problem is that the PATH_INFO variable is defined (by the CGI specification, and subsequently by WSGI) to be pre-decoded. This is a dreadful mistake partly because of issue (1) above, which means you can't get %2F in a path part, but more seriously because decoding a %-sequence introduces a Unicode decode step that is out of the hands of the application. Server environments differ greatly in how non-ASCII %-escapes in the URL are handled, and it is often impossible to recreate the exact sequence of bytes that the web browser passed in.

IIS is a particular problem in that it will try to parse the URL path as UTF-8 by default, falling back to the wildly-unreliable system default codepage (eg. cp1252 on a Western Windows install) if the path isn't a valid UTF-8 sequence, but without telling you. You are then likely to have fairly severe problems trying to read any non-ASCII characters in PATH_INFO out of the environment variables map, because Windows envvars are Unicode but are accessed by Python 2 and many others as bytes in the system codepage.

Apache mitigates the problem by providing an extra non-standard environ REQUEST_URI that holds the original, completely undecoded URL submitted by the browser, which is easy to handle manually. However if you are using URL rewriting or error documents, that unmapped URL may not match what you thought it was going to be.

Some frameworks attempt to fix up these problems, with varying degrees of success. WSGI 1.1 is expected to make a stab at standardising this, but in the meantime the practical position we're left in is that Unicode paths won't work everywhere, and hacks to try to fix it on one server will typically break it on another.

You can always use URL rewriting to convert a Unicode path into a Unicode query parameter. Since the QUERY_STRING environ variable is not decoded outside of the application, it is much easier to handle predictably.

bobince