ansaurus

Question

How to print tuples of unicode strings in original language (not u'foo' form)

Answer 1

A:

Try:

import codecs, sys
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

MarkusQ 2009-03-07 04:53:25

Are you using MS windows by any chance? If so, you should tag it because it may change the range of answers that will work for you.

MarkusQ 2009-03-07 06:01:16

Answer 2

A:

Python source code files are strictly ASCII, so you must use the \u escape sequences unless you specify an encoding. See PEP 0263.

#!/usr/bin/python
# coding=utf-8
t = [u'亀', u'犬']
print t

When you pass an array to print, Python converts the object into a string using Python's rules for string conversions. The output of such conversions are designed for eval(), which is why you see those \u sequences. Here's a hack to get around that based on bobince's solution. The console must accept Unicode or this will throw an exception.

t = [(u'亀',), (u'犬',)]
print repr(t).decode('raw_unicode_escape')

Matthew 2009-03-07 04:55:28

When I do that, I now get:[(u'\u4e80',), (u'\u72ac',)]which is different, but not what I want.

Daniel H 2009-03-07 05:14:05

Answer 3

A:

So this appears to do what I want:

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])


>>> t = [('亀',), ('犬',)]
>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]
>>> print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])
[(亀,), (犬,)]

Surely there's a better way to do it.

(but other two answers thus far don't result in the original string being printed out as desired).

Daniel H 2009-03-07 05:12:18

How does the "sys.stdout = codecs.getwriter('utf8')(sys.stdout)" example fail? Perhaps you are on a terminal which uses something other than utf8?

Andrew Dalke 2009-03-07 12:30:11

I've revised my answer to address output too.

Matthew 2009-03-08 00:42:45

Actually, I just noticed bobince's answer. It's possible to massively simplify that!

Matthew 2009-03-08 01:03:30

Answer 4

+3 A:

First, there's a slight misunderstanding in your post. If you define a list like this:

>>> t = [('亀',), ('犬',)]

...those are not unicodes you define, but strs. If you want to have unicode types, you have to add a u before the character:

>>> t = [(u'亀',), (u'犬',)]

But let's assume you actually want strs, not unicodes. The main problem is, __str__ method of a list (or a tuple) is practically equal to its __repr__ method (which returns a string that, when evaluated, would create exactly the same object). Because __repr__ method should be encoding-independent, strings are represented in the safest mode possible, i.e. each character outside of ASCII range is represented as a hex character (\xe4, for example).

Unfortunately, as far as I know, there's no library method for printing a list that is locale-aware. You could use an almost-general-purpose function like this:

def collection_str(collection):
    if isinstance(collection, list):
        brackets = '[%s]'
        single_add = ''
    elif isinstance(collection, tuple):
        brackets = '(%s)'
        single_add =','
    else:
        return str(collection)
    items = ', '.join([collection_str(x) for x in collection])
    if len(collection) == 1:
        items += single_add
    return brackets % items

>>> print collection_str(t)
[('亀',), ('犬',)]

Note that this won't work for all possible collections (sets and dictionaries, for example), but it's easy to extend it to handle those.

DzinX 2009-03-07 12:45:51

Also, “return str(collection)” wouldn't include the ' quotes (or escape characters like \ in the string); you get [(亀,), (犬,)].

bobince 2009-03-07 23:30:43

You're right, thanks. Hmm... I can see no nice way to fix that :)

DzinX 2009-03-07 23:47:40

Answer 5

+5 A:

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

What do you want to see it printed out on? Because if it's the console, it's not at all guaranteed your console can display those characters. This is why Python's ‘repr()’ representation of objects goes for the safe option of \-escapes, which you will always be able to see on-screen and type in easily.

As a prerequisite you should be using Unicode strings (u''). And, as mentioned by Matthew, if you want to be able to write u'亀' directly in source you need to make sure Python can read the file's encoding. For occasional use of non-ASCII characters it is best to stick with the escaped version u'\u4e80', but when you have a lot of East Asian text you want to be able to read, “# coding=utf-8” is definitely the way to go.

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])

That would print the characters unwrapped by quotes. Really you'd want:

def reprunicode(u):
    return repr(u).decode('raw_unicode_escape')

print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in t])

This would work, but if the console didn't support Unicode (and this is especially troublesome on Windows), you'll get a big old UnicodeError.

In any case, this rarely matters because the repr() of an object, which is what you're seeing here, doesn't usually make it to the public user interface of an application; it's really for the coder only.

However, you'll be pleased to know that Python 3.0 behaves exactly as you want:

plain '' strings without the ‘u’ prefix are now Unicode strings
repr() shows most Unicode characters verbatim
Unicode in the Windows console is better supported (you can still get UnicodeError on Unix if your environment isn't UTF-8)

Python 3.0 is a little bit new and not so well-supported by libraries, but it might well suit your needs better.

bobince 2009-03-07 12:49:29

Thanks for pointing out the 'raw_unicode_escape' encoding. I had no idea that existed!

Matthew 2009-03-08 01:06:48

ansaurus

tags:

views:

answers:

How to print tuples of unicode strings in original language (not u'foo' form)

related questions