ansaurus

Question

How to unquote a urlencoded unicode string in python?

Answer 1

+2 A:

def unquote(text):
    def unicode_unquoter(match):
        return unichr(int(match.group(1),16))
    return re.sub(r'%u([0-9a-fA-F]{4})',unicode_unquoter,text)

MizardX 2008-11-18 23:22:24

Answer 2

+13 A:

%uXXXX is a non-standard encoding scheme that has been rejected by the w3c, despite the fact that an implementation continues to live on in JavaScript land.

The more common technique seems to be to UTF-8 encode the string and then % escape the resulting bytes using %XX. This scheme is supported by urllib.unquote:

>>> urllib2.unquote("%0a")
'\n'

Unfortunately, if you really need to support %uXXXX, you will probably have to roll your own decoder. Otherwise, it is likely to be far more preferable to simply UTF-8 encode your unicode and then % escape the resulting bytes.

A more complete example:

>>> u"Tanım"
u'Tan\u0131m'
>>> url = urllib.quote(u"Tanım".encode('utf8'))
>>> urllib.unquote(url).decode('utf8')
u'Tan\u0131m'

Aaron Maenpaa 2008-11-18 23:22:44

'urllib2.unquote' should be 'urllib.unquote'

jamtoday 2009-09-07 00:30:47

Answer 3

+2 A:

This will do it if you absolutely have to have this (I really do agree with the cries of "non-standard"):

from urllib import unquote

def unquote_u(source):
    result = unquote(source)
    if '%u' in result:
        result = result.replace('%u','\\u').decode('unicode_escape')
    return result

print unquote_u('Tan%u0131m')

> Tanım

Ali A 2008-11-18 23:32:49

A slightly pathological case, but: unquote_u('Tan%25u0131m') --> u'Tan\u0131m' rather than 'Tan%u0131' like it should. Just a reminder of why you probably don't want to write a decoder unless you really need it.

Aaron Maenpaa 2008-11-18 23:44:07

I totally agree. Which is why I really was not keen to offer an actual solution. These things are never so straightforward. The O.P. might have been desperate though, and I think this complements your excellent answer.

Ali A 2008-11-18 23:48:41

Answer 4

A:

there is a bug in the above version where it freaks out sometimes when there are both ascii encoded and unicode encoded characters in the string. I think its specifically when there are characters from the upper 128 range like '\xab' in addition to unicode.

eg. "%5B%AB%u03E1%BB%5D" causes this error.

I found if you just did the unicode ones first, the problem went away:

def unquote_u(source):
  result = source
  if '%u' in result:
    result = result.replace('%u','\\u').decode('unicode_escape')
  result = unquote(result)
  return result

2008-12-16 03:13:58

ansaurus

tags:

views:

answers:

How to unquote a urlencoded unicode string in python?

related questions