views:

58

answers:

2

Hi, I'm struggling with print and unicode conversion. Here is some code executed in the 2.5 windows interpreter.

>>> import sys
>>> print sys.stdout.encoding
cp850
>>> print u"é"
é
>>> print u"é".encode("cp850")
é
>>> print u"é".encode("utf8")
├®
>>> print u"é".__repr__()
u'\xe9'

>>> class A():
...    def __unicode__(self):
...       return u"é"
...
>>> print A()
<__main__.A instance at 0x0000000002AEEA88>

>>> class B():
...    def __repr__(self):
...       return u"é".encode("cp850")
...
>>> print B()
é

>>> class C():
...    def __repr__(self):
...       return u"é".encode("utf8")
...
>>> print C()
├®

>>> class D():
...    def __str__(self):
...       return u"é"
...
>>> print D()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

>>> class E():
...    def __repr__(self):
...       return u"é"
...
>>> print E()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

So, when a unicode string is printed, it's not it's __repr__() function which is called and printed.
But when an object is printed __str__() or __repr__() (if __str__ not implemented) is called, not __unicode__(). Both can not return a unicode string.
But why? Why if __repr__() or __str__() return a unicode string, shouldn't it be the same behavior than when we print a unicode string? I other words: why print D() is different from print D().__str__()

Am I missing something?

These samples also show that if you want to print an object represented with unicode strings, you have to encode it to a object string (type str). But for nice printing (avoid the "├®"), it's dependent of the sys.stdout encoding.
So, do I have to add u"é".encode(sys.stdout.encoding) for each of my __str__ or __repr__ method? Or return repr(u"é")? What if I use piping? Is is the same encoding than sys.stdout?

My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters). Here is the bad behavior/code that needs to be modified:

class User(object):
    name = u"Luiz Inácio Lula da Silva"
    def __repr__(self):
        # returns unicode
        return "<User: %s>" % self.name
        # won't display gracefully
        # expl: print repr(u'é') -> u'\xe9'
        return repr("<User: %s>" % self.name)
        # won't display gracefully
        # expl: print u"é".encode("utf8") -> print '\xc3\xa9' -> ├®
        return ("<User: %s>" % self.name).encode("utf8")

Thanks!

A: 

I presume your sys.getdefaultencoding() is still 'ascii'. And I think this is being used whenever str() or repr() of an object are applied. You could change that with sys.setdefaultencoding(). As soon as you write to a stream, though, be it STDOUT or a file, you have to comply with its encoding. This would also apply for piping on the shell, IMO. I assume that 'print' honors the STDOUT encoding, but the exception happens before 'print' is invoked, when constructing its argument.

ThomasH
+2  A: 

Python doesn't have many semantic type constraints on given functions and methods, but it has a few, and here's one of them: __str__ (in Python 2.*) must return a byte string. As usual, if a unicode object is found where a byte string is required, the current default encoding (usually 'ascii') is applied in the attempt to make the required byte string from the unicode object in question.

For this operation, the encoding (if any) of any given file object is irrelevant, because what's being returned from __str__ may be about to be printed, or may be going to be subject to completely different and unrelated treatment. Your purpose in calling __str__ does not matter to the call itself and its results; Python, in general, doesn't take into account the "future context" of an operation (what you are going to do with the result after the operation is done) in determining the operation's semantics.

That's because Python doesn't always know your future intentions, and it tries to minimize the amount of surprise. print str(x) and s = str(x); print s (the same operations performed in one gulp vs two), in particular, must have the same effects; if the second case, there will be an exception if str(x) cannot validly produce a byte string (that is, for example, x.__str__() can't), and therefore the exception should also occur in the other case.

print itself (since 2.4, I believe), when presented with a unicode object, takes into consideration the .encoding attribute (if any) of the target stream (by default sys.stdout); other operations, as yet unconnected to any given target stream, don't -- and str(x) (i.e. x.__str__()) is just such an operation.

Hope this helped show the reason for the behavior that is annoying you...

Edit: the OP now clarifies "My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters).". Here's the approach I think works best for that specific goal:

import sys

DEFAULT_ENCODING = 'UTF-8'  # or whatever you like best

class sic(object):

    def __unicode__(self):  # the "real thing"
        return u'Pel\xe9'

    def __str__(self):      # tries to "look nice"
        return unicode(self).encode(sys.stdout.encoding or DEFAULT_ENCODING,
                                    'replace')

    def __repr__(self):     # must be unambiguous
        return repr(unicode(self))

That is, this approach focuses on __unicode__ as the primary way for the class's instances to format themselves -- but since (in Python 2) print calls __str__ instead, it has that one delegate to __unicode__ with the best it can do in terms of encoding. Not perfect, but then Python 2's print statement is far from perfect anyway;-).

__repr__, for its part, must strive to be unambiguous, that is, not to "look nice" at the expense of risking ambiguity (ideally, when feasible, it should return a byte string that, if passed to eval, would make an instance equal to the present one... that's far from always feasible, but the lack of ambiguity is the absolute core of the distinction between __str__ and __repr__, and I strongly recommend respecting that distinction!).

Alex Martelli
Thanks Alex, I see now why `print D()` has a different behavior than `print D().__str__()`. It was a bit confusing. So, could you share any guidelines when you need to handle unicode strings in the __repr__ or __str__ methods? Should I return a repr() of the whole unicode or encode it to a string object? Or I could still return an unicode and set encoding with sys.setdefaultencoding in a custom site module (but I found this to be too intrusive).
Thorfin
@Thorfin, to return Unicode, implement `__unicode__`. `__str__` should always return a byte string, and `__repr__` a byte string that "ideally" (but that's not always possible or reasonable) one could `eval` to build a new object.
Alex Martelli
I believe `__unicode__` is only called in conjunction with unicode(), and unfortunately that doesn't solve my problems. I have added some info at the end of the body of my initial question. Thanks again.
Thorfin
`__repr__` is _supposed_ to return totally unambiguous output -- it would be an abomination to have it avoid escape sequences in the output (PLEASE don't do that!). Editing the A to show the best way to achieve your specific desired result.
Alex Martelli
Thanks! I have well understood the need for __repr__ to return an unambiguous output. I was finishing implementing the same behavior when I saw your example, just `return repr(self.__unicode__())` instead of `return repr(unicode(self))`. I believe/hope it's the same thing.
Thorfin
@Thorlin, yes, if your class does have a `__unicode__` method, then `unicode(self)` ends up calling `self.__unicode__()`.
Alex Martelli