views:

1319

answers:

5

Can someone explain to me this odd thing:

When in python shell I type the following Cyrillic string:

>>> print 'абвгд'
абвгд

but when I type:

>>> print u'абвгд'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

Since the first tring came out correctly, I reckon my OS X terminal can represent unicode, but it turns out it can't in the second case. Why ?

A: 

A unicode object needs to be encoded before it can be displayed on some consoles. Try

u'абвгд'.encode()

instead to encode the unicode to a string object (most likely using utf8 as a default encoding, but depends on your python config)

workmad3
this is not working - encode() throws the same error.
Discodancer
+2  A: 

Also, make sure the terminal encoding is set to Unicode/UTF-8 (and not ascii, which seems to be your setting):

http://www.rift.dk/news.php?item.7.6

cdonner
I figured that one, but what bugs me is that my terminal DOES show unicode properly if it's typed as a normal string - e.g. 'уникоде', but throws an error if I try to print the same string as u'уникоде'
Discodancer
+5  A: 

In addition to ensuring your OS X terminal is set to UTF-8, you may wish to set your python sys default encoding to UTF-8 or better. Create a file in /Library/Python/2.5/site-packages called sitecustomize.py. In this file put:

import sys
sys.setdefaultencoding('utf-8')

The setdefaultencoding method is available only by the site module, and is removed from the sys namespace once startup has completed. As such, you'll need to start a new python interpreter for the change to take effect. You can verify the current default coding at any time after startup with sys.getdefaultencoding().

If the characters aren't already unicode and you need to convert them, use the decode method on a string in order to decode the text from some other charset into unicode... best to specify which charset:

s = 'абвгд'.decode('some_cyrillic_charset') # makes the string unicode
print s.encode('utf-8') # transform the unicode into utf-8, then print it
Jarret Hardie
This solved my problems, although the repr() explanation is not correct. I made a mistake in my question (sorry) which I now fixed - I WAS printing the u'абвгд' string actually, so it's not a repr() error.In fact - I do not get the error if I omit the print statement - I just get u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'My guess would be that the default encoding - mac-roman is somehow able to represent cyrilic chars (which, on the other had doesn't make sense ...), but not cyrilic in unicode. I really dont get this :)
Discodancer
Thanks for the info discodancer... you are right... my bad.
Jarret Hardie
+9  A: 
>>> print 'абвгд'
абвгд

When you type in some characters, your terminal decides how these characters are represented to the application. Your terminal might give the characters to the application encoded as utf-8, ISO-8859-5 or even something that only your terminal understands. Python gets these characters as some sequence of bytes. Then python prints out these bytes as they are, and your terminal interprets them in some way to display characters. Since your terminal usually interprets the bytes the same way as it encoded them before, everything is displayed like you typed it in.

>>> u'абвгд'

Here you type in some characters that arrive at the python interpreter as a sequence of bytes, maybe encoded in some way by the terminal. With the u prefix python tries to convert this data to unicode. To do this correctly python has to known what encoding your terminal uses. In your case it looks like Python guesses your terminals encoding would be ASCII, but the received data doesn't match that, so you get an encoding error.

The straight forward way to create unicode strings in an interactive session would therefore be something like this this:

>>> us = 'абвгд'.decode('my-terminal-encoding')

In files you can also specify the encoding of the file with a special mode line:

# -*- encoding: ISO-8859-5 -*-
us = u'абвгд'

For other ways to set the default input encoding you can look at sys.setdefaultencoding(...) or sys.stdin.encoding.

sth
Yeah, this makes a lot of sense to me, thanks.
Discodancer
+1 very comprehensive answer
Jarret Hardie
A: 

'абвгд' is not a unicode string

u'абвгд' is a unicode string

You cannot print unicode strings without encoding them. When you are dealing with strings in your application you want to make sure that any input is decoded and any output in encoded. This way your application will deal only with unicode strings internally and output strings in UTF8.

For reference:

>>> 'абвгд'.decode('utf8') == u'абвгд'
>>> True
hekevintran