views:

295

answers:

4

I have an Excel spreadsheet that I'm reading in that contains some £ signs.

When I try to read it in using the xlrd module, I get the following error:

x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.

How can I fix this, and read the £ signs in correctly?

--- UPDATE ---

Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.

If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):

for item in items:
    #item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
 cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)

I get the same error even if I uncomment the Latin-1 line.

+3  A: 

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.

So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?

I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).

If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

Alex Martelli
Thanks. I guess it's unicode then - but I need to encode it to write it out to a CSV file. When I try to encode it into latin-1, I get another error. Please see my update above...
AP257
Figured it out - the line I needed before writing to CSV was item = [x.encode('mac_roman') for x in item]. Accepting this answer because it helped me work out what was happening - thanks.
AP257
+1  A: 

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.

When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.


>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£

# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'

# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>
katrielalex
please see my update above: I get an error when I try to .encode it...
AP257
+1  A: 

For what it's worth: I'm the author of xlrd.

Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)

You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?

You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.

Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.

It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

John Machin
A: 
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

Look closely: You got a Unicode*Encode*Error calling the decode method.

The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.

In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as

x = x.encode('ascii').decode("ISO-8859-1")

which fails because x contains a non-ASCII character.

Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.

for item in items:
    item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)

This would be correct, except that you have the character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:

  • Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
  • Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
  • Replace the bullets with a Latin-1 character like * or ·.
  • Use a character encoding that does contain all the characters you need.

These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

dan04