ansaurus

Question

How can I deal with accented letters, german letters and other characters?

Answer 1

A:

Parse the proper charset from the headers returned by urlopen() and pass it as the fromEncoding argument to the BeautifulSoup constructor.

Ignacio Vazquez-Abrams 2010-10-20 19:58:10

Answer 2

+1 A:

Don't parse http://translate.google.com/translate_t since Google provides an AJAX service for this purpose. The translatedText in the json data returned by ajax.googleapis.com is already a unicode string.

import urllib2
import urllib
import sys
import json

LANG={
    "arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
    "croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
    "english":"en", "finnish":"fi", "french":"fr", "german":"de",
    "greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
    "korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
    "romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }

def translate(text,lang1,lang2):
    base_url='http://ajax.googleapis.com/ajax/services/language/translate?'    
    langpair='%s|%s'%(LANG.get(lang1.lower(),lang1),
                      LANG.get(lang2.lower(),lang2))
    params=urllib.urlencode( (('v',1.0),
                       ('q',text.encode('utf-8')),
                       ('langpair',langpair),) )
    url=base_url+params
    content=urllib2.urlopen(url).read()
    try: trans_dict=json.loads(content)
    except AttributeError:
        try: trans_dict=json.load(content)    
        except AttributeError: trans_dict=json.read(content)
    return trans_dict['responseData']['translatedText']

print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")

yields

Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos días a ti amigo!

unutbu 2010-10-20 20:19:33

Hello thanks for the help! I didn't know they offered JSON, I'll look into this as well. When running your example, I get the following error: Traceback (most recent call last): File "C:\Users\Sergio\Documents\NetBeansProjects\TranslateMyPyAjax\src\translatemypyajax.py", line 35, in <module> print translate("Good morning to you friend!", "English", "Spanish")UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 8: ordinal not in range(128)

Serg 2010-10-20 20:27:05

It's translating german and italian correctly, but not the spanish, I'm guessing because of the accented i.

Serg 2010-10-20 20:29:19

@Sergio: I think this might be an error that affects Windows users when trying to print unicode to a Windows console. See if http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console/2013263#2013263 or http://stackoverflow.com/questions/3789924/python-os-walk-and-japanese-filename-crash/3791196#3791196 helps.

unutbu 2010-10-20 20:31:06

ansaurus

tags:

views:

answers:

How can I deal with accented letters, german letters and other characters?

related questions