ansaurus

Question

Printing objects and unicode, what's under the hood ? What are the good guidelines?

Answer 1

A:

I presume your sys.getdefaultencoding() is still 'ascii'. And I think this is being used whenever str() or repr() of an object are applied. You could change that with sys.setdefaultencoding(). As soon as you write to a stream, though, be it STDOUT or a file, you have to comply with its encoding. This would also apply for piping on the shell, IMO. I assume that 'print' honors the STDOUT encoding, but the exception happens before 'print' is invoked, when constructing its argument.

ThomasH 2010-08-24 14:03:38

Answer 2

+2 A:

Python doesn't have many semantic type constraints on given functions and methods, but it has a few, and here's one of them: __str__ (in Python 2.*) must return a byte string. As usual, if a unicode object is found where a byte string is required, the current default encoding (usually 'ascii') is applied in the attempt to make the required byte string from the unicode object in question.

For this operation, the encoding (if any) of any given file object is irrelevant, because what's being returned from __str__ may be about to be printed, or may be going to be subject to completely different and unrelated treatment. Your purpose in calling __str__ does not matter to the call itself and its results; Python, in general, doesn't take into account the "future context" of an operation (what you are going to do with the result after the operation is done) in determining the operation's semantics.

That's because Python doesn't always know your future intentions, and it tries to minimize the amount of surprise. print str(x) and s = str(x); print s (the same operations performed in one gulp vs two), in particular, must have the same effects; if the second case, there will be an exception if str(x) cannot validly produce a byte string (that is, for example, x.__str__() can't), and therefore the exception should also occur in the other case.

print itself (since 2.4, I believe), when presented with a unicode object, takes into consideration the .encoding attribute (if any) of the target stream (by default sys.stdout); other operations, as yet unconnected to any given target stream, don't -- and str(x) (i.e. x.__str__()) is just such an operation.

Hope this helped show the reason for the behavior that is annoying you...

Edit: the OP now clarifies "My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters).". Here's the approach I think works best for that specific goal:

import sys

DEFAULT_ENCODING = 'UTF-8'  # or whatever you like best

class sic(object):

    def __unicode__(self):  # the "real thing"
        return u'Pel\xe9'

    def __str__(self):      # tries to "look nice"
        return unicode(self).encode(sys.stdout.encoding or DEFAULT_ENCODING,
                                    'replace')

    def __repr__(self):     # must be unambiguous
        return repr(unicode(self))

That is, this approach focuses on __unicode__ as the primary way for the class's instances to format themselves -- but since (in Python 2) print calls __str__ instead, it has that one delegate to __unicode__ with the best it can do in terms of encoding. Not perfect, but then Python 2's print statement is far from perfect anyway;-).

__repr__, for its part, must strive to be unambiguous, that is, not to "look nice" at the expense of risking ambiguity (ideally, when feasible, it should return a byte string that, if passed to eval, would make an instance equal to the present one... that's far from always feasible, but the lack of ambiguity is the absolute core of the distinction between __str__ and __repr__, and I strongly recommend respecting that distinction!).

Alex Martelli 2010-08-24 14:21:11

Thanks Alex, I see now why `print D()` has a different behavior than `print D().__str__()`. It was a bit confusing. So, could you share any guidelines when you need to handle unicode strings in the __repr__ or __str__ methods? Should I return a repr() of the whole unicode or encode it to a string object? Or I could still return an unicode and set encoding with sys.setdefaultencoding in a custom site module (but I found this to be too intrusive).

Thorfin 2010-08-24 15:09:25

@Thorfin, to return Unicode, implement `__unicode__`. `__str__` should always return a byte string, and `__repr__` a byte string that "ideally" (but that's not always possible or reasonable) one could `eval` to build a new object.

Alex Martelli 2010-08-24 15:15:41

I believe `__unicode__` is only called in conjunction with unicode(), and unfortunately that doesn't solve my problems. I have added some info at the end of the body of my initial question. Thanks again.

Thorfin 2010-08-24 17:59:34

`__repr__` is _supposed_ to return totally unambiguous output -- it would be an abomination to have it avoid escape sequences in the output (PLEASE don't do that!). Editing the A to show the best way to achieve your specific desired result.

Alex Martelli 2010-08-24 19:53:50

Thanks! I have well understood the need for __repr__ to return an unambiguous output. I was finishing implementing the same behavior when I saw your example, just `return repr(self.__unicode__())` instead of `return repr(unicode(self))`. I believe/hope it's the same thing.

Thorfin 2010-08-25 17:40:30

@Thorlin, yes, if your class does have a `__unicode__` method, then `unicode(self)` ends up calling `self.__unicode__()`.

Alex Martelli 2010-08-25 21:02:18

ansaurus

tags:

views:

answers:

Printing objects and unicode, what's under the hood ? What are the good guidelines?

related questions