tags:

views:

358

answers:

5

I have a list of tuples of unicode objects:

>>> t = [('亀',), ('犬',)]

Printing this out, I get:

>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]

which I guess is a list of the utf-8 byte-code representation of those strings?

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

but I'm having an inordinate amount of trouble getting the bytecode back into a human-readable form.

A: 

Try:

import codecs, sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
MarkusQ
Are you using MS windows by any chance? If so, you should tag it because it may change the range of answers that will work for you.
MarkusQ
A: 

Python source code files are strictly ASCII, so you must use the \u escape sequences unless you specify an encoding. See PEP 0263.

#!/usr/bin/python
# coding=utf-8
t = [u'亀', u'犬']
print t

When you pass an array to print, Python converts the object into a string using Python's rules for string conversions. The output of such conversions are designed for eval(), which is why you see those \u sequences. Here's a hack to get around that based on bobince's solution. The console must accept Unicode or this will throw an exception.

t = [(u'亀',), (u'犬',)]
print repr(t).decode('raw_unicode_escape')
Matthew
When I do that, I now get:[(u'\u4e80',), (u'\u72ac',)]which is different, but not what I want.
Daniel H
A: 

So this appears to do what I want:

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])


>>> t = [('亀',), ('犬',)]
>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]
>>> print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])
[(亀,), (犬,)]

Surely there's a better way to do it.

(but other two answers thus far don't result in the original string being printed out as desired).

Daniel H
How does the "sys.stdout = codecs.getwriter('utf8')(sys.stdout)" example fail? Perhaps you are on a terminal which uses something other than utf8?
Andrew Dalke
I've revised my answer to address output too.
Matthew
Actually, I just noticed bobince's answer. It's possible to massively simplify that!
Matthew
+3  A: 

First, there's a slight misunderstanding in your post. If you define a list like this:

>>> t = [('亀',), ('犬',)]

...those are not unicodes you define, but strs. If you want to have unicode types, you have to add a u before the character:

>>> t = [(u'亀',), (u'犬',)]

But let's assume you actually want strs, not unicodes. The main problem is, __str__ method of a list (or a tuple) is practically equal to its __repr__ method (which returns a string that, when evaluated, would create exactly the same object). Because __repr__ method should be encoding-independent, strings are represented in the safest mode possible, i.e. each character outside of ASCII range is represented as a hex character (\xe4, for example).

Unfortunately, as far as I know, there's no library method for printing a list that is locale-aware. You could use an almost-general-purpose function like this:

def collection_str(collection):
    if isinstance(collection, list):
        brackets = '[%s]'
        single_add = ''
    elif isinstance(collection, tuple):
        brackets = '(%s)'
        single_add =','
    else:
        return str(collection)
    items = ', '.join([collection_str(x) for x in collection])
    if len(collection) == 1:
        items += single_add
    return brackets % items

>>> print collection_str(t)
[('亀',), ('犬',)]

Note that this won't work for all possible collections (sets and dictionaries, for example), but it's easy to extend it to handle those.

DzinX
Also, “return str(collection)” wouldn't include the ' quotes (or escape characters like \ in the string); you get [(亀,), (犬,)].
bobince
You're right, thanks. Hmm... I can see no nice way to fix that :)
DzinX
+5  A: 

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

What do you want to see it printed out on? Because if it's the console, it's not at all guaranteed your console can display those characters. This is why Python's ‘repr()’ representation of objects goes for the safe option of \-escapes, which you will always be able to see on-screen and type in easily.

As a prerequisite you should be using Unicode strings (u''). And, as mentioned by Matthew, if you want to be able to write u'亀' directly in source you need to make sure Python can read the file's encoding. For occasional use of non-ASCII characters it is best to stick with the escaped version u'\u4e80', but when you have a lot of East Asian text you want to be able to read, “# coding=utf-8” is definitely the way to go.

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])

That would print the characters unwrapped by quotes. Really you'd want:

def reprunicode(u):
    return repr(u).decode('raw_unicode_escape')

print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in t])

This would work, but if the console didn't support Unicode (and this is especially troublesome on Windows), you'll get a big old UnicodeError.

In any case, this rarely matters because the repr() of an object, which is what you're seeing here, doesn't usually make it to the public user interface of an application; it's really for the coder only.

However, you'll be pleased to know that Python 3.0 behaves exactly as you want:

  • plain '' strings without the ‘u’ prefix are now Unicode strings
  • repr() shows most Unicode characters verbatim
  • Unicode in the Windows console is better supported (you can still get UnicodeError on Unix if your environment isn't UTF-8)

Python 3.0 is a little bit new and not so well-supported by libraries, but it might well suit your needs better.

bobince
Thanks for pointing out the 'raw_unicode_escape' encoding. I had no idea that existed!
Matthew