views:

278

answers:

4

I was looking at this question and started wondering what does the print actually do.

I have never found out how to use string.decode() and string.encode() to get an unicode string "out" in the python interactive shell in the same format as the print does. No matter what I do, I get either

  1. UnicodeEncodeError or
  2. the escaped string with "\x##" notation...

This is python 2.x, but I'm already trying to mend my ways and actually call print() :)

Example:

>>> import sys
>>> a = '\xAA\xBB\xCC'
>>> print(a)
ª»Ì
>>> a.encode(sys.stdout.encoding)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xaa in position 0: ordinal not in range(128)
>>> a.decode(sys.stdout.encoding)
u'\xaa\xbb\xcc'

EDIT:

Why am I asking this? I am sick and tired of encode() errors and realized that since print can do it (at least in the interactive shell). I know that the MUST BE A WAY to magically do the encoding PROPERLY, by digging the info what encoding to use from somewhere...

ADDITIONAL INFO: I'm running Python 2.4.3 (#1, Sep 3 2009, 15:37:12) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2

>>> sys.stdin.encoding
'ISO-8859-1'
>>> sys.stdout.encoding
'ISO-8859-1'

However, the results are the same with Python 2.6.2 (r262:71600, Sep 8 2009, 13:06:43) on the same linux box.

+3  A: 

EDIT: (Major changes between this edit and the previous one... Note: I'm using Python 2.6.4 on an Ubuntu box.)

Firstly, in my first attempt at an answer, I provided some general information on print and str which I'm going to leave below for the benefit of anybody having simpler issues with print and chancing upon this question. As for a new attempt at dealing with the issue experienced by the OP... Basically, I'm inclined to say that there's no silver bullet here and if print somehow manages to make sense of a weird string literal, then that's not reproducible behaviour. I'm led to this conclusion by the following funny interaction with Python in my terminal window:

>>> print '\xaa\xbb\xcc'
��

Have you tried to input ª»Ì directly from the terminal? At a Linux terminal using utf-8 as the encoding, this is actually read in as six bytes, which can then be made to look like three unicode chars with the help of the decode method:

>>> 'ª»Ì'
'\xc2\xaa\xc2\xbb\xc3\x8c'
>>> 'ª»Ì'.decode(sys.stdin.encoding)
u'\xaa\xbb\xcc'

So, the '\xaa\xbb\xcc' literal only makes sense if you decode it as a latin-1 literal (well, actually you could use a different encoding which agrees with latin-1 on the relevant characters). As for print 'just working' in your case, it certainly doesn't for me -- as mentioned above.

This is explained by the fact that when you use a string literal not prefixed with a u -- i.e. "asdf" rather than u"asdf" -- the resulting string will use some non-unicode encoding. No; as a matter of fact, the string object itself is going to be encoding-unaware, and you're going to have to treat it as if it was encoded with encoding x, for the correct value of x. This basic idea leads me to the following:

a = '\xAA\xBB\xCC'
a.decode('latin1')
# result: u'\xAA\xBB\xCC'
print(a.decode('latin1'))
# output: ª»Ì

Note the lack of decoding errors and the proper output (which I expect to be stay proper at any other box). Apparently your string literal can be made sense of by Python, but not without some help.

Does this help? (At least in understanding how things work, if not in making the handling of encodings any easier...)


Now for some funny bits with some explanatory value (hopefully)! This works fine for me:

sys.stdout.write("\xAA\xBB\xCC".decode('latin1').encode(sys.stdout.encoding))

Skipping either the decode or the encode part results in a unicode-related exception. Theoretically speaking, this makes sense, as the first decode is needed to decide what characters there are in the given string (the only thing obvious on first sight is what bytes there are -- the Python 3 idea of having (unicode) strings for characters and bytes for, well, bytes, suddenly seems superbly reasonable), while the encode is needed so that the output respects the encoding of the output stream. Now this

sys.stdout.write("ąöî\n".decode(sys.stdin.encoding).encode(sys.stdout.encoding))

also works as expected, but the characters are actually coming from the keyboard and so are actually encoded with the stdin encoding... Also,

ord('ą'.decode('utf-8').encode('latin2'))

returns the correct 177 (my input encoding is utf-8), but '\xc4\x85'.encode('latin2') makes no sense to Python, as it has no clue as to how to make sense of '\xc4\x85' and figures that trying the 'ascii' code is the best it can do.


The original answer:

The relevant bit of Python docs (for version 2.6.4) says that print(obj) is meant to print out the string given by str(obj). I suppose you could then wrap it in a call to unicode (as in unicode(str(obj))) to get a unicode string out -- or you could just use Python 3 and exchange this particular nuisance for a couple of different ones. ;-)

Incidentally, this shows that you can manipulate the result of printing an object just like you can manipulate the result of calling str on an object, that is by messing with the __str__ method. Example:

class Foo(object):
    def __str__(self):
        return "I'm a Foo!"

print Foo()

As for the actual implementation of print, I expect this won't be useful at all, but if you really want to know what's going on... It's in the file Python/bltinmodule.c in the Python sources (I'm looking at version 2.6.4). Search for a line beginning with builtin_print. It's actually entirely straightforward, no magic going on there. :-)

Hopefully this answers your question... But if you do have a more arcane problem which I'm missing entirely, do comment, I'll make a second attempt. Also, I'm assuming we're dealing with Python 2.x; otherwise I guess I wouldn't have a useful comment.

Michał Marczyk
Unfortunately the builtin_print is not in that file in 2.4 http://svn.python.org/view/python/branches/release24-maint/Python/bltinmodule.c?view=markup
Kimvais
I guess that's because back then, `print` was still syntax, wheras `builtin_print` is needed to get it to work as a function. Also, when decoding strings coming from stdin, you'll want to use `sys.stdin.encoding` rather than `sys.stdout.encoding` -- though on today's typical box in all probability they're the same.
Michał Marczyk
Um, I guess I'm only hoping to clarify the goings on under the hood with the last amendment to the answer -- as for what can be done to avoid encoding issues, I guess it's not very optimistic. Anyway, I wonder if it does clarify anything... And then there's my new comment attached to the question itself. I'm definately beginning to share in the "academic interest" involved here. (I'm adding this to interesting tags, BTW. ;-))
Michał Marczyk
+2  A: 

print() uses sys.stdout.encoding to determine what the output console can understand and then uses this encoding in the call to str.encode().

[EDIT] If you look at the source, it gets sys.stdout and then calls:

PyFile_WriteObject(PyTuple_GetItem(args, i), file,
                 Py_PRINT_RAW);

I guess the magic is in Py_PRINT_RAW but the source just says:

    if (flags & Py_PRINT_RAW) {
    value = PyObject_Str(v);
    }

So no magic here. A loop over the arguments with sys.stdout.write(str(item)) should do the trick.

Aaron Digulla
+1 for clearing up the important subtlety I entirely missed in my answer.
Michał Marczyk
While this is probably correct, it does not seem to answer my question. Apparently, print() eventually calls sys.stdout.write() that does some magic because the str.encode(sys.stdout.encoding) fails...
Kimvais
@Kimvais: I looked up the source. No magic.
Aaron Digulla
+1  A: 
>>> import sys
>>> a = '\xAA\xBB\xCC'
>>> print(a)
ª»Ì

All print is doing here is writing raw bytes to sys.stdout. The string a is a string of bytes, not Unicode characters.

Why am I asking this? I am sick and tired of encode() errors and realized that since print can do it (at least in the interactive shell). I know that the MUST BE A WAY to magically do the encoding PROPERLY, by digging the info what encoding to use from somewhere...

Alas no, print is doing nothing at all magical here. You hand it some bytes, it dumps the bytes to stdout.

To use .encode() and .decode() properly, you need to understand the difference between bytes and characters, and I'm afraid you do have to figure out the correct encoding to use.

Jason Orendorff
A: 
import sys

source_file_encoding = 'latin-1' # if there is no -*- coding: ... -*- line

a = '\xaa\xbb\xcc' # raw bytes that represent string in source_file_encoding

# print bytes, my terminal tries to interpret it as 'utf-8'
sys.stdout.write(a+'\n') 
# -> ��

ua = a.decode(source_file_encoding)
sys.stdout.write(ua.encode(sys.stdout.encoding)+'\n')
# -> ª»Ì

See Defining Python Source Code Encodings

J.F. Sebastian