views:

42

answers:

1

Hi. I want to use Google Language Detection API in my app to detect language of url parameter. For example user requests url

http://myapp.com/q?Это тест

and gets message "Russian". I do it this way:

def get(self):                                            
        url = "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q="+self.request.query                        
        try:
            data = json.loads(urllib2.urlopen(url).read())                
            self.response.out.write('<html><body>' + data["responseData"]["language"] +'</body></html>')                                  
        except urllib2.HTTPError, e:
            self.response.out.write( "HTTP error: %d" % e.code )
        except urllib2.URLError, e:
            self.response.out.write( "Network error: %s" % e.reason.args[1])

but always get "English" as result because url is encoded in

http://myapp.com/q?%DD%F2%EE%20%F2%E5%F1%F2

I've tried urllib.quote , urllib.urlencode with no luck.

How I have to decode this url for Google Api?

+1  A: 

Maybe urllib.unquote is what you are looking for:

>>> from urllib import unquote
>>> unquote("%DD%F2%EE%20%F2%E5%F1%F2")

This gives you a string in which the characters are in whatever encoding that you've used in the URL. If you want to recode it to a different encoding (say, UTF-8), you have to create a unicode object first and then use the encode method of the unicode object to recode it:

>>> from urllib import unquote, quote
>>> import json, urllib2, pprint
>>> decoded = unicode(unquote("%DD%F2%EE%20%F2%E5%F1%F2"), "windows-1251")
>>> print decoded
Это тест
>>> recoded = decoded.encode("utf-8")

At this point, we have an UTF-8 encoded string, but this is still not suitable to be passed on to the Google Language Detection API:

>>> recoded
'\xd0\xad\xd1\x82\xd0\xbe \xd1\x82\xd0\xb5\xd1\x81\xd1\x82'

Since you want to include this string in a URL as a query argument, you have to encode it using urllib.quote:

>>> url = "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&amp;q=%s" % quote(recoded)
>>> data = json.loads(urllib2.urlopen(url).read())
>>> pprint.pprint(data)
{u'responseData': {u'confidence': 0.094033934,
                   u'isReliable': False,
                   u'language': u'ru'},
 u'responseDetails': None,
 u'responseStatus': 200}
Tamás
Looks good when I try to print it, but when I send it to Google it throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position
Orsol
You have to pass `recoded` on to `urllib.quote` to obtain a representation which can safely be appended to the Google Language API URL. I'm modifying my example to show that.
Tamás