(The following is using Python 2.6.1)
I have 2 strings:
>>> a = u'\u05e8\u05db\u05e1'
>>> b = u'\u05e8\u05db\u05e1 \u05d4\u05d9\u05d0 \u05de\u05d0\u05d9\u05e8\u05d4 \u05d1\u05e4\u05e0\u05e1'
I encode them:
>>> ua = a.encode('utf-8')
>>> ub = b.encode('utf-8')
>>> ua
'\xd7\xa8\xd7\x9b\xd7\xa1'
>>> ub
'\xd7\xa8\xd7\x9b\xd7\xa1 \xd7\x94\xd7\x99\xd7\x90 \xd7\x9e\xd7\x90\xd7\x99\xd7\xa8\xd7\x94 \xd7\x91\xd7\xa4\xd7\xa0\xd7\xa1'
and try to print:
>>> print ua
רכס
>>> print ub
רכס היא מאירה בפנס
Why does ub
print in Hebrew characters while ua
doesn't? ua
is just the first few characters of ub
, so it seems as though string length is somehow the problem, which is weird.
(For the record, this came up trying to parse a webpage with BeautifulSoup -- I couldn't tell why some paragraphs came out readably while others didn't.)