views:

170

answers:

2

I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.

I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is: line = line.decode("GB18030").

The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out

UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence

In this file, there are about 5 such errors in about a million lines.

I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.

Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave: UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence.

A: 

You might try chardet.

Mark Tolonen
+1  A: 

""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?

Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.

How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?

How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?

Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.

In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.

You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.

If the file is not confidential, you may like to make it available for download.

Was the file created on Windows? How are you reading it in Python? (show code)

Update after OP comments:

Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.

Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.

If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.

It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.

You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...

Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.

If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')

Update after further information

All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())

The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?

Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.

Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.

Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.

The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""

It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:

C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character

Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:

>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'

And the offer to send you some debugging code (see example output below) is still open.

>>> import decode_debug as de
>>> def logger(s):
...    sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>

Eureka: -- Probable cause of sometimes losing the quote character --

It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:

>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>
John Machin
Thanks for the suggestions. 1.) It was successfully decoded in both senses. For the first two files at least. On the third file it was successful in the sense that what got decoded was corrected. But for the lines with euro signs, it threw an error. 2.) Firefox is not recognizing the encoding correctly. 3.) I am not sure what decoding is being read in text editor -- in notepad/Notepad++ it just says ANSI; from what I've read this seems odd/incorrect. 4.) The text is from mainland China, contains no documentation and is indeed confidential.
rallen
Notepad *does* guess the encoding of your file, but only from a very limited subset (system character set, or ANSI, UTF-8, UTF-16LE, UTF-16BE). That's the source of the semi-famous ["Bush hid the facts"](http://en.wikipedia.org/wiki/Bush_hid_the_facts) bug.
Michael Madsen
I didn't realize GBK and GB18030 were so close, so I had given up on GBK once I got my first error early on in decoding the file. What GB18030 was decoding that GBK was not was a one character wide blank space. Unfortunately, my original Access files contain this same space when opened. I think I will just go with the "ignore" errors option, since I found that there were relatively few such errors in my files when I used GB18030, and all the ones I checked ended up being in columns of my data that don't matter. Thanks again.
rallen
@rallen: Thanks for the update. (1) Which "one character wide blank space" was that? Can't be U+3000 IDEOGRAPHIC SPACE which is the first two-byte character (`'\xA1\xA1'`) in all `GB*` encodings. What is the GB18030 code for this space? (2) May be a real EURO problem after all: Python codec for GBK aka CP936 has a bug; it hasn't updated it to include adding mapping 0x80 -> U+20AC EURO SIGN (10 years after it was added)
John Machin
@rallen: (3) Do the space and the euro account for all your known problems? If not, there's the possibility that you have data created in Guandong or HK on a box running Chinese Windows (PRC locale, so gbk/cp936) with MS bolt-on gadget to support HKSCS ... there's no codec for this. I have some scripts for easier examining files that "almost" decode; if you need/want any further help, e-mail me (googling "john machin xlrd" should turn up the address).
John Machin
@Michael Madsen: Yes, I should have said "doesn't guess very well". BTW, ANSI means "system character set" on Windows, and is thus a movable feast; cp1252 is the usual suspect but it could be e.g. cp949 (Korean).
John Machin
@John: Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\\xbe\\x80\\x80" . The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well quotes, which poses a problem for the csv module to determine columns. And the "blank space" is \\xf8\\xf8 . Sorry for the delay in getting back with you -- I just figured out how to get just the bytes.
rallen
@John: '\xf8\xf8'.decode('gb18030') was indeed working for me. You had asked what the characters were that gb18030 was decoding that gbk was not, and IIRC \xf8\xf8 was the only such example. I can't check examples until tomorrow, but I believe the " was lost when there was only one such \x80 -- e.g. C1 C2 80 22 -> C1 C2 when I used "ignore", which is in line with your "Eureka".
rallen
@rallen: There are TWO major differences between "as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe" and "preceded by \x81 or \xfe": (1) "\x81 or \xfe" omits all bytes in the range \x82 to \xfd inclusive (2) you can only know whether you are in a valid-two byte sequence by starting from a known/assumed valid character boundary. Bytes in the range \x81-\xfe are valid lead bytes AND trail bytes in a two-byte sequence. Try decoding \xcb\xbe\x80\x22 (\xbe is a trail byte, \x80 is illegal) and \xbe\x80\x22 (\xbe is a lead byte, \x80 is a valid trail byte).
John Machin
rallen