views:

108

answers:

2
>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

which is what i really wanted it to store internally as i remove non-ascii characters. Why did the decode("ascii give out a unicode string ?

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

Which is again what i wanted. I don't understand this behavior. Can someone explain to me what is happening here?

edit: i thought this would me understand things so i could solve my real program problem that i state here: http://stackoverflow.com/questions/3669436/converting-unicode-objects-with-non-ascii-symbols-in-them-into-strings-objects-in

+3  A: 

Why did the decode("ascii") give out a unicode string?

Because that's what decode is for: it decodes byte strings like your ASCII one into unicode.

In your second example, you're trying to "decode" a string which is already unicode, which has no effect. To print it to your terminal, though, Python must encode it as your default encoding, which is ASCII - but because you haven't done that step explicitly and therefore haven't specified the 'ignore' parameter, it raises the error that it can't encode the non-ASCII characters.

The trick to all of this is remembering that decode takes an encoded bytestring and converts it to Unicode, and encode does the reverse. It might be easier if you understand that Unicode is not an encoding.

Daniel Roseman
Well, you are right, except for some details. Since he can print `'a\xf5'` correctly, his terminals encoding is not ascii but .. something else. The console encoding is a really common problem, but it's not the case this time. Also, `teststringUni.decode("ascii" , "ignore")` does not fail when you try to print the result. It tells Python that teststringUni is a ascii encoded string (it is clearly unicode, but Python trusts the user) and tries to decode it - which cannot work ofc.
THC4k
yes, i think that is the problem: What is my terminal encoding? Just because an object type is string it does not mean the encoding is ascii, i understood that. My problem now is to figure out how i can translate something that has type unicode into the the string type of the terminal, while retaining all information.
Fullmooninu
+2  A: 

It's simple: .encode converts Unicode objects into strings, and .decode converts strings into Unicode.

Ned Batchelder
this perspective actually solved it =) , thank you
Fullmooninu