tags:

views:

186

answers:

2

I have a problem with python about reading and print utf8 text file.

I have a test.txt in utf8 encoding without BOM. This file has two characters in it:

大声

The first character "大" is Chinese and the second "声" is Japanese. Now, When I use Ulipad (a python editor) to run the following code to read the txt file, and print these two characters.

import codecs
infile = "C:\\test.txt"

f = codecs.open(infile, "r", "utf-8")
s = f.read()

print(s)

I got this error,

"UnicodeEncodeError: 'cp950' codec can't encode character '\u58f0' in position 1:
 illegal multibyte sequence"

I found it caused from the second character "声" .

But when I use the same code to test in python default GUI IDLE, it works to print the two characters with no error. So, how can I fix the problem.

My running environment is python 3.1 , windows xp traditional Chinese.

+6  A: 

You get the error when you are printing because:

(1) Ulipad is printing to sys.stdout which is the stdout of the legacy MS-DOS Command Prompt window. (2) Your traditional chinese Windows XP uses cp950 encoding, which is big5 plus Microsoftian fiddling. (3) You say your 2nd character is Japanese by which you probably mean that it's not also Chinese and thus unlikely to be a valid character in big5+.

On the other hand IDLE is writing to its own window and is not bound on the MS-DOS wheel :-) ... so there's a much greater repertoire of characters that it can print.

John Machin
Actually, IDLE is based on Tkinter (basically, the Tk toolkit) which supports the full range of Unicode (and even automatically does font substitution), although I'm not as sure about its bidirectional capabilites…
ΤΖΩΤΖΙΟΥ
A: 

声 may be Japanese, but it is also the Simplified Chinese for "sound" (traditional 聲). cp950 is Traditional Chinese and doesn't support that simplified character.

Since you are using a Chinese version of Windows, you may be able to change your default code page to cp936 (Unified Chinese) and see the output.

I'm unfamiliar with Ulipad, but try running in a Windows console:

chcp 936

and then running your script. If that doesn't work, you can change the default language for non-Unicode programs through Control Panel, Regional and Language Options, Advanced tab. This is how I was able to print Chinese in a console on my US English-based Windows.

Update

Reading about Ulipad, it says:

Multilanguage support Currently supports 4 languages: English, Spanish, Simplified Chinese and Traditional Chinese, which can be auto-detected.

Perhaps you can override the auto-detected Traditional Chinese to Simplified Chinese, which may select a code page and/or font that supports that particular character. Since it doesn't support Japanese, there will probably still be characters you can't display properly.

Mark Tolonen
But changing his codepage will mess up anything else that needs cp950.
John Machin
True, but 'chcp 936' is temporary and local to the command prompt in which it was entered. Change the non-Unicode default generally only affects older programs. I noticed only a couple of problems switching my US English system to Chinese(PRC). I'm just suggesting options. He specifically wanted to fix Ulipad, which sounds like it requires a code-page change...Hmm, reading about Ulipad, I'll update my answer with another option...
Mark Tolonen