views:

258

answers:

5

I read Japanese, and want to try processing some Japanese text. I tried this using Python 3:

for i in range(1,65535):
    print(chr(i), end='')

Python then gave me tons of errors. What went wrong?

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Traceback (most recent call last):
  File "C:\test\char.py", line 11, in <module>
    print(chr(i), end='')
  File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to <undefined>

My understanding is that the chr function goes on to convert Unicode numbers into the respective Japanese characters. If so, why are the Japanese characters not outputted? Why does it crash at the end of the list of Roman characters?

Please also correct me if I am mistaken in my understanding that the Unicode set was devised solely to cater for non-Western languages.


EDIT:

I tried the 3 lines suggested by John Machin in IDLE, and the output worked!

Before this, I had been using Programmer's Notepad, with the Tools set to capture python.exe compiler's output. Perhaps that is why the errors came about.

However, for most other things, the output is captured properly; then why does it fail particularly in this process? i.e. Why does the code work in the IDLE Python Shell, but not through Programmer's Notepad output capture? Shouldn't the output be the same, regardless of the interface?

A: 

You're attempting to encode a character (\x80) that isn't defined by your codec; there is no correct mapping so charmap_encode raises an exception. You could wrap the print statement in a try: block, then catch and ignore the exception to only print the characters that you can encode.

Wooble
I'm sorry, but I have 3 other questions:1. What is a 'codec' in regard to Python? Is it a module?2. What is 'correct mapping'?3. How may I produce a list of Japanese characters using, say, a for loop?
anonnoir
A codec is a "coder/decoder" (look up "modem", it's similar). Actually, it should be "encoder/decoder", but I guess a "modem" analogue sounded nicer (would have turned into "encdec"... and that's ugly for sure). It's a "thing" (not sure if it's a module or just a class, I think the latter) that both encodes into a specific encoding (from Unicode, which isn't an encoding) and decodes into Unicode. See it as a "translator". "correct mapping"... well, it's just that the codec Python uses doesn't know how to encode the \x80. "isn't defined" as Wooble said. And now I'm out of characters (haha)
jae
+2  A: 

You problem is your default terminal (output) encoding. Probably latin-1 or even the perennial Python default, ASCII. Those can't encode japanese characters (since it's assumed that the terminal can't display them).

If your terminal does UTF-8 (the most often used Unicode encoding in the western world), you can either "trick" Python into taking this as the default output encoding, or you can explicity encode the unicode to UTF-8 with

>>>> print (chr(i).encode("UTF-8"), end='')

And as to the "solely", I think that's wrong. It was created to be the one encoding to bind them... ehm, sorry, the one and only encoding we'll ever need. The encoding (okay, that's using "encoding" not in the sense it's used in the Unicode definition) that can be used to encode all text documents.

jae
How do I trick Python into changing the terminal output encoding? I am using IDLE. Also, your code worked! Thank you! I tried using this loop: "for i in range(1, 10): print(chr(i).encode("UTF-8"), end='')" and Python gave me this output: "b'\x01'b'\x02'b'\x03'b'\x04'b'\x05'b'\x06'b'\x07'b'\x08'b'\t'" But how may I convert the current output into normal characters?
anonnoir
Also, why is does "print(ord(b'\x01'))" successfully print out = 1, but "print(ord(u'\ua000'))" produces an error, "SyntaxError: invalid syntax"?
anonnoir
A: 

No need to try all 65536 codes of the BMP. Just use the code blocks used for Japanese text

devio
A: 

Assuming you are using Python 2.x here is the documentation for the built-in codec module. Python 3.x uses Unicode internally so you don't have to think as much about non-Latin character sets.

msw
It's snark-time: note the "C:\Python31\" bit in the second code part. ;-)
jae
snark deserved :(
msw
@jae: Your snark is a boojum -- note the "I tried this using Python 3" in the first line.
John Machin
And I obviously got snarked indirectly (but very deservedly) by you, John, in the "remember the mention of cp1252" in your answer. Or was that unintentional? :-)
jae
@jae: No, you were not even a pixel on my radar screen when I wrote that. In any case, standard Bayesian inference should lead you to the conclusion that even without confirming evidence, an unknown encoding is highly likely to be cp1252 :-)
John Machin
+3  A: 

If as you say you read Japanese, you must be aware that Japanese is written using FOUR different types of characters: (1) kanji (Chinese characters) (2) Katakana (3) Hiragana (4) Romaji ("Roman" letters). There are many tens of thousands of kanji of which only a few thousand are in common use.

Your code, had it worked as you imagined it might, would have printed not only the the "Roman" characters, but also Greek, Arabic, Hebrew, Cyrillic (used in Russian etc), Armenian, half a dozen or so different but related character sets used in India, many I've left out, about 11 thousand Hangul Syllables (used in Korean) and a bunch of gibberish for code points that aren't used, and (depending on which shell you were running it in) may have crashed when it got to 0xD800 (the first surrogate).

A little less ambition will give you Hiragana, Katakana, and a few "CJK Unified Ideographs". The examples below were run in IDLE.

>>> for i in range(0x3040, 0x30a0): print(chr(i), end='')

぀ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゗゘゙゚゛゜ゝゞゟ
>>> for i in range(0x30a0, 0x3100): print(chr(i), end='')

゠ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾヿ
>>> for i in range(0x4e00, 0x4f00): print(chr(i), end='')

一丁丂七丄丅丆万丈三上下丌不与丏丐丑丒专且丕世丗丘丙业丛东丝丞丟丠両丢丣两严並丧丨丩个丫丬中丮丯丰丱串丳临丵丶丷丸丹为主丼丽举丿乀乁乂乃乄久乆乇么义乊之乌乍乎乏乐乑乒乓乔乕乖乗乘乙乚乛乜九乞也习乡乢乣乤乥书乧乨乩乪乫乬乭乮乯买乱乲乳乴乵乶乷乸乹乺乻乼乽乾乿亀亁亂亃亄亅了亇予争亊事二亍于亏亐云互亓五井亖亗亘亙亚些亜亝亞亟亠亡亢亣交亥亦产亨亩亪享京亭亮亯亰亱亲亳亴亵亶亷亸亹人亻亼亽亾亿什仁仂仃仄仅仆仇仈仉今介仌仍从仏仐仑仒仓仔仕他仗付仙仚仛仜仝仞仟仠仡仢代令以仦仧仨仩仪仫们仭仮仯仰仱仲仳仴仵件价仸仹仺任仼份仾仿

Update The reason you had a problem is that the shell/IDE that you were using supplies only the Windows GUI bog-standard stdout, for which the default encoding (in your neck of the woods) is cp1252 (remember the mention of cp1252 in your traceback?) which is adequate in your case for the Romaji but not much else. Available-anywhere-without-downloads alternatives: (1) IDLE (2) write file encoded in UTF-8 and read it in Notepad. I'm sure others could suggest other IDEs.

John Machin
Yes, I am aware of the Japanese syllabaries, the extent of use of Chinese characters in Japanese, and that my code would have printed out other characters not related to the former two. I tried all 3 codes, but I keep getting this error:return codecs.charmap_encode(input,self.errors,encoding_table)[0]UnicodeEncodeError: 'charmap' codec can't encode character '\u3040' in position 0: character maps to <undefined>
anonnoir
@user283169: don't put that detail in comments; edit your question to show the full traceback and error message, and tell us what shell (e.g. IDLE) you are using. Note that I ran what I showed you in Python 3.1's IDLE (using the interactive prompt) under Windows XP SP2. You appear to be running under Windows also; if you can't reproduce my results, you'd need to say rather exactly what you are doing (including the contents of "char.py").
John Machin
I'm sorry about that. I will take the advice and do that next time around. (A quickie: if I update the original question, will the post be 'bumped up'?) An interesting development came up, and I have updated my question to reflect it.
anonnoir
I'm sorry, but I don't quite understand your answer. I've tried changing the code page/character set to UTF-8 and Shift-JIS in Programmer's Notepad, but this doesn't seem to help. I don't know if that's the right way to do it, but I guess I'll have to try using IDLE or some other IDE for now.
anonnoir
Code page / charset that you are talking about is highly likely to be the charset to use for your **source code** (.py) files. What you need to look for is a way of changing the charset that it uses for **stdout** when you use it to run a script. Currently it's using cp1252 (as witnessed by the traceback that you posted; go back and re-read it to see what I mean). Failing that, change IDEs.
John Machin