ansaurus

Question

Answer 1

+2 A:

Use unicode objects instead of byte strings:

#!/usr/bin/python
# coding=utf-8

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Unicode objects deal with characters directly.

jcdyer 2010-03-19 12:02:46

This doesn’t solve the problem.

jleedev 2010-03-19 12:16:35

using u'string' I get UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)solved by doing a print j.encoding('utf-8')but that seems extremely awkward...

Fredrik 2010-03-19 12:21:27

@jleedev—My console says otherwise. Can you be more specific? What results are you getting? @Fredrik—Sounds like your terminal wants to use Latin-1 encoding. You'll have to find a way to convince it to use UTF-8, or write your output to a file instead of printing (I recommend `import codecs; f = codecs.open('output.txt', encoding='utf-8')`). Good luck!

jcdyer 2010-03-19 15:08:44

@jleedev—Ah. I see what's going on. It depends on your font, to some extent, and there's nothing python can do about that, but it does fix the issue with the character counts in the second `for` loop.

jcdyer 2010-03-19 15:12:16

Answer 2

+2 A:

Using the unicodedata.east_asian_width function, keep track of which characters are narrow and wide when computing the length of the string.

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))

Outputs:

する          : dipsy
します        : laa-laa
trazan        : banarne
した          : po
しました      : tinky winky

It doesn’t look right in some web browser fonts, but in a terminal window they line up properly.

jleedev 2010-03-19 12:08:25

tab is not a solution, what I'm really are doing is generating sphinx-tables containing japanese verb conjugations. I'll check the east_asian_width function...

Fredrik 2010-03-19 12:24:13

perfect, just what I was looking for in theory at least. Trying to run it though gives me this:$ ./try.py Traceback (most recent call last): File "./try.py", line 12, in <module> print i,' '*(12-width(i)),':',jUnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)

Fredrik 2010-03-19 12:34:11

@Fredrick Ouch, you might need to look at `sys.setdefaultencoding`. http://blog.ianbicking.org/illusive-setdefaultencoding.html

jleedev 2010-03-19 12:37:20

extremely annoying, cant get it to work...>>> import sys>>> sys.getdefaultencoding()'utf-8'can you pls post the complete code?

Fredrik 2010-03-19 13:15:40

Ok, I think the correct solution is to not use the default encoding, but to explicitly encode every unicode string into the codec you want. See this question (http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python). OS X appears to have customized this problem away...

jleedev 2010-03-19 13:55:15

thanks, its working for me now; sorry about my confusion ;-)

Fredrik 2010-03-19 15:00:52

ansaurus

tags:

views:

answers:

Python utf-8, howto align printout

related questions