views:

139

answers:

2

Hi,

I have a array containing japanese caracters as well as "normal". How do I align the printout of these?

#!/usr/bin/python
# coding=utf-8

a1=['する', 'します', 'trazan', 'した', 'しました']
a2=['dipsy', 'laa-laa', 'banarne', 'po', 'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Output:

する       : dipsy
します    : laa-laa
trazan       : banarne
した       : po
しました : tinky winky
--------
する 6
dipsy 5
します 9
laa-laa 7
trazan 6
banarne 7
した 6
po 2
しました 12
tinky winky 11

thanks, //Fredrik

+2  A: 

Use unicode objects instead of byte strings:

#!/usr/bin/python
# coding=utf-8

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Unicode objects deal with characters directly.

jcdyer
This doesn’t solve the problem.
jleedev
using u'string' I get UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)solved by doing a print j.encoding('utf-8')but that seems extremely awkward...
Fredrik
@jleedev—My console says otherwise. Can you be more specific? What results are you getting? @Fredrik—Sounds like your terminal wants to use Latin-1 encoding. You'll have to find a way to convince it to use UTF-8, or write your output to a file instead of printing (I recommend `import codecs; f = codecs.open('output.txt', encoding='utf-8')`). Good luck!
jcdyer
@jleedev—Ah. I see what's going on. It depends on your font, to some extent, and there's nothing python can do about that, but it does fix the issue with the character counts in the second `for` loop.
jcdyer
+2  A: 

Using the unicodedata.east_asian_width function, keep track of which characters are narrow and wide when computing the length of the string.

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))

Outputs:

する          : dipsy
します        : laa-laa
trazan        : banarne
した          : po
しました      : tinky winky

It doesn’t look right in some web browser fonts, but in a terminal window they line up properly.

jleedev
tab is not a solution, what I'm really are doing is generating sphinx-tables containing japanese verb conjugations. I'll check the east_asian_width function...
Fredrik
perfect, just what I was looking for in theory at least. Trying to run it though gives me this:$ ./try.py Traceback (most recent call last): File "./try.py", line 12, in <module> print i,' '*(12-width(i)),':',jUnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
Fredrik
@Fredrick Ouch, you might need to look at `sys.setdefaultencoding`. http://blog.ianbicking.org/illusive-setdefaultencoding.html
jleedev
extremely annoying, cant get it to work...>>> import sys>>> sys.getdefaultencoding()'utf-8'can you pls post the complete code?
Fredrik
Ok, I think the correct solution is to not use the default encoding, but to explicitly encode every unicode string into the codec you want. See this question (http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python). OS X appears to have customized this problem away...
jleedev
thanks, its working for me now; sorry about my confusion ;-)
Fredrik