views:

107

answers:

1

Lately, I've had lots of trouble with __repr__(), format(), and encodings. Should the output of __repr__() be encoded or be a unicode string? Is there a best encoding for the result of __repr__() in Python? What I want to output does have non-ASCII characters.

I use Python 2.x, and want to write code that can easily be adapted to Python 3. The program thus uses

# -*- coding: utf-8 -*-
from __future__ import unicode_literals, print_function  # The 'Hello' literal represents a Unicode object

Here are some problems that have been bothering me, and I'm looking for a solution that solves them:

  1. Printing to an UTF-8 terminal should work (I have sys.stdout.encoding set to UTF-8, but it would be best if other cases worked too).
  2. Piping the output to a file (encoded in UTF-8) should work (in this case, sys.stdout.encoding is None).
  3. My code for many __repr__() functions currently has many return ….encode('utf-8'), and that's heavy. Is there anything robust and lighter?
  4. In some cases, I even have ugly beasts like return ('<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8'), i.e., the representation of objects is decoded, put into a formatting string, and then re-encoded. I would like to avoid such convoluted transformations.

What would you recommend to do in order to write simple __repr__() functions that behave nicely with respect to these encoding questions?

+5  A: 

In Python2, __repr__ (and __str__) must return a string object, not a unicode object. In Python3, the situation is reversed, __repr__ and __str__ must return unicode objects, not byte (née string) objects:

class Foo(object):
    def __repr__(self):
        return u'\N{WHITE SMILING FACE}' 

class Bar(object):
    def __repr__(self):
        return u'\N{WHITE SMILING FACE}'.encode('utf8')

repr(Bar())
# ☺
repr(Foo())
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)

In Python2, you don't really have a choice. You have to pick an encoding for the return value of __repr__.

By the way, have you read the PrintFails wiki? It may not directly answer your other questions, but I did find it helpful in illuminating why certain errors occur.


When using from __future__ import unicode_literals,

'<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8')

can be more simply written as

str('<{}>').format(repr(x))

assuming str encodes to utf-8 on your system.

Without from __future__ import unicode_literals, the expression can be written as:

'<{}>'.format(repr(x))
unutbu
It would be nice if the documentation mentioned this :) (http://docs.python.org/reference/datamodel.html#basic-customization does not)… Anyway… you would say that the approach in point 4 in the question is cumbersome but necessary, right?
EOL
EOL: Assuming Python2, `repr(x)` must return an encoded string. If it was encoded with utf-8, then `repr(x).decode('utf8').encode('utf8')` should not be necessary.If `repr(x)` is encoded with some other encoding, `repr(x).decode('utf8')` will either fail (with UnicodeDecodeError) or produce bogus results, or maybe decode correctly by lucky happenstance. So, AFAIK, `repr(x).decode('utf8').encode('utf8')`should never be necessary. Can you provide an example?
unutbu
@EOL, **The return value must be a string object.** is how the reference manual page you point to expresses the constraint that the return value must be an instance of `str` (a unicode object would not be "a string object"). `repr` is _normally_ expected to return ascii only (thing of `repr(uo)` for all unicode objects, for example: even _that_ returns ascii only -- I think no built-in or standard library type behaves otherwise) but strictly speaking that is not a language constraint, so it's not the reference manual's business. Proposed docs patches are always welcome, btw!-)
Alex Martelli
@Alex: Thank you for the comments. I guess that my confusion comes from the fact that one also says "Unicode string", in Python 2.x: that's why I was wondering whether `__repr__()` could also return a *Unicode* string… I have been thinking of submitting doc patches. :)
EOL
@~unutbu: I should have put parentheses in the example, which differs from what you put in the comment: the decoded object is put *into a formatting string* before encoding. I updated the original question.
EOL
@EOL, yes, I find string-related terminology ("string", "unicode string", "raw string", ...) unfortunately at risk of ambiguity in common discourse -- I _try_ to always use rigorously non-ambiguous terms such as "str instance", "unicode object", "rawstring _literal_ ", and so forth, but sometimes such rigorous terminology feels stilted in non-formal contexts. In the Language Reference, the only occurrences of the unfortunate "unicode string" are in a single paragraph in 2.4.1 (literals): s/string/object/ there and "string" becomes unambiguous *in the Language Reference* (where it matters).
Alex Martelli
It's also possible that the Language Reference is _deliberately_ ambiguous because it's **not** supposed to be a Reference for **CPython** only, but for _all_ conforming Python implementations: in Jython and IronPython, which we're very keen to consider fully conforming implementations, **all** strings are Unicode (and it would be costly and totally against their respective platforms to make things otherwise). Maybe we do need a supplemental **CPython** implementation-specific reference, as an _addition_ to the implementation-neutral **Language** one.
Alex Martelli
@~unutbu: since `from __future__ import unicode_literals` is in force, '<{}>' *is* a Unicode string. So, it looks again like you're confirming that what I'm doing is correct; it's good to get such a confirmation. I'll mark your question as accepted if you can remove the part that assumes that '<{}>' is a str.
EOL
@EOL: Ah, I forgot about `unicode_literals`. Yes, I agree with you then. If you didn't have `unicode_literals` turned on, however, you could write `'<{}>'.format(repr(x))` instead of `'<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8')`. Are you sure that `from __future__ import unicode_literals` is worth it?
unutbu
Of course, `str('<{}>').format(repr(x))` would also work...See http://stackoverflow.com/questions/809796/any-gotchas-using-unicode-literals-in-python-2-6
unutbu
@~unutbu: Unicode with Python 2.x *is* tricky: `'<{}>'.format(repr(x))` does *not* work when you have bytes with value > 127 in the representation (because the literal creates a Unicode object)! Thank you for the `str(…).format()` suggestion. As for the `from __future__`, I like the fact that string literals are Unicode objects, because these objects correspond to Python 3's strings (one of the goals is to prepare the transition to Python 3).
EOL
@EOL: I'm not sure that `from __future__ import unicode_literals` is helping you prepare for Python3. Think about what your code should look like in Python3. It would just be `'<{}>'.format(repr(x))`. Anything you write that deviates from that, even `str('<{}>').format(repr(x))`, is just cruft that will have to be fixed during the transition. Are you sure that `'<{}>'.format(repr(x))` does not work if you turn off `unicode_literals`?
unutbu
@~unutbu: good point, about the simpler code when not using `unicode_literals`. I'll turn it off (in which case the simpler code does indeed work). If you can remove the part with "may be incorrect" (which refers to a different situation than that of the question, which assumed Unicode litterals), I'll mark your answer as accepted.
EOL
@EOL: agreed. Best of luck with your work.
unutbu