views:

519

answers:

4

I am trying to parse a CSV file containing some data, mostly numeral but with some strings - which I do not know their encoding, but I do know they are in Hebrew.

Eventually I need to know the encoding so I can unicode the strings, print them, and perhaps throw them into a database later on.

I tried using Chardet, which claims the strings are Windows-1255 (cp1255) but trying to do print someString.decode('cp1255') yields the notorious error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4: ordinal not in range(128)

I tried every other encoding possible, to no avail. Also, the file is absolutely valid since I can open the CSV in Excel and I see the correct data.

Any idea how I can properly decode these strings?


EDIT: here is an example. One of the strings looks like this (first five letters of the Hebrew alphabet):

print repr(sampleString)
#prints:
'\xe0\xe1\xe2\xe3\xe4'

(using Python 2.6.2)

A: 

Is someString is maybe not a normal string, but a unicode string, like you would have us believe with your sampleString?

>>> print '\xe0\xe1\xe2\xe3\xe4'.decode('cp1255')
<hebrew characters>

>>> print u'\xe0\xe1\xe2\xe3\xe4'.decode('cp1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/encodings/cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters [...]
hop
uhm... why the downvote?
hop
Because someString is obviously not a unicode string. If it was, print repr(someString) would display u'...'.
codeape
@codeape: yeah, but he never actually showed us the result of repr(someString), did he? he only showed us repr(sampleString), which could be a different thing entirely.
hop
I assumed sampleString was of the same kind as someString.
codeape
A: 

You're getting an encode error when printing, so most likely it's decoding fine, you just can't print out the result properly. Try running chcp 65001 at the command prompt before starting the Python code.

Ignacio Vazquez-Abrams
Running `chcp 65001` won't help with this because the error is that Python itself is trying to implicitly encode the unicode string for `print` using its default encoding, ASCII.
Mike Graham
+3  A: 

When you decode the string to unicode with someString.decode('cp1255'), you have an abstract representation of some Hebrew text in unicode. (This part happens successfully!) When you use print, you need a concrete, encoded representation in a specific encoding. It looks like your problem isn't with the decode, but with the print.

To print, either just print someString if your terminal understands cp1255 or "print someString.decode('cp1255').encode('the_encoding_your_terminal_does_understand')". If you don't need the resulting print to be readable as Hebrew, print repr(someString.decode('cp1255')) also gets you meaningful representation of the abstract unicode string.

Mike Graham
+5  A: 

This is what's happening:

  • sampleString is a byte string (cp1255 encoded)
  • sampleString.decode("cp1255") decodes (decode==bytes -> unicode string) the byte string to a unicode string
  • print sampleString.decode("cp1255") attempts to print the unicode string to stdout. Print has to encode the unicode string to do that (encode==unicode string -> bytes). The error that you're seeing means that the python print statement cannot write the given unicode string to the console's encoding. sys.stdout.encoding is the terminal's encoding.

So the problem is that your console does not support these characters. You should be able to tweak the console to use another encoding. The details on how to do that depends on your OS and terminal program.

Another approach would be to manually specify the encoding to use:

print sampleString.decode("cp1255").encode("utf-8")

See also:

A simple test program you can experiment with:

import sys
print sys.stdout.encoding
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print samplestring.decode("cp1255").encode(sys.argv[1])

On my utf-8 terminal:

$ python2.6 test.py utf-8
UTF-8
אבגדה

$ python2.6 test.py latin1
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)

$ python2.6 test.py ascii
UTF-8
Traceback (most recent call last):
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

$ python2.6 test.py cp424
UTF-8
ABCDE

$ python2.6 test.py iso8859_8
UTF-8
�����

The error messages for latin-1 and ascii means that the unicode characters in the string cannot be represented in these encodings.

Notice the last two. I encode the unicode string to the cp424 and iso8859_8 encodings (two of the encodings listed on http://docs.python.org/library/codecs.html#standard-encodings that supports hebrew characters). I get no exception using these encodings, since the hebrew unicode characters have a representation in the encodings.

But my utf-8 terminal gets very confused when it receives bytes in a different encoding than utf-8.

In the first case (cp424), my UTF-8 terminal displays ABCDE, meaning that the utf-8 representation of A corresponds to the cp424 representation of ה, i.e. the byte value 65 means A in utf-8 and ה in cp424.

The encode method has an optional string argument you can use to specify what should happen when the encoding cannot represent a character (documentation). The supported strategies are strict (the default), ignore, replace, xmlcharref and backslashreplace. You can even add your own custom strategies.

Another test program (I print with quotes around the string to better show how ignore behaves):

import sys
samplestring = '\xe0\xe1\xe2\xe3\xe4'
print "'{0}'".format(samplestring.decode("cp1255").encode(sys.argv[1], 
      sys.argv[2]))

The results:

$ python2.6 test.py latin1 strict
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    sys.argv[2]))
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-4: ordinal not in range(256)
[/tmp]
$ python2.6 test.py latin1 ignore
''
[/tmp]
$ python2.6 test.py latin1 replace
'?????'
[/tmp]
$ python2.6 test.py latin1 xmlcharrefreplace
'&#1488;&#1489;&#1490;&#1491;&#1492;'
[/tmp]
$ python2.6 test.py latin1 backslashreplace
'\u05d0\u05d1\u05d2\u05d3\u05d4'
codeape
Superb answer! Explains everything one needs to know about Python string encodings... Thanks a bunch
Yuval A