views:

2333

answers:

4

I'm trying to translate an Excel spreadsheet to CSV using the Python xlrd and csv modules, but am getting hung up on encoding issues. Xlrd produces output from Excel in Unicode, and the CSV module requires UTF-8.

I imaging that this has nothing to do with the xlrd module: everything works fine outputing to stdout or other outputs that don't require a specific encoding.

The worksheet is encoded as UTF-16-LE, according to book.encoding

The simplified version of what I'm doing is:

from xlrd import *
import csv
b = open_workbook('file.xls')
s = b.sheet_by_name('Export')
bc = open('file.csv','w')
bcw = csv.writer(bc,csv.excel,b.encoding)
for row in range(s.nrows):
    this_row = []
    for col in range(s.ncols):
     this_row.append(s.cell_value(row,col))
    bcw.writerow(this_row)

This produces the following error, about 740 lines in:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)

The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)

I'll be the first to admit that I have only a vague sense of how character encoding works, so most of what I've tried so far are various fumbling permutations of .encode and .decode on the s.cell_value(row,col)

If someone could suggest a solution I would appreciate it -- even better if you could provide an explanation of what's not working and why, so that I can more easily debug these problems myself in the future.

Thanks in advance!

EDIT:

Thanks for the comments so far.

When I user this_row.append(s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors.

The output isn't particularly desirable (text:u'516-7773167'), but it avoids the error even though the offending characters are still in the output.

This makes me think that the challenge might be in xlrd after all.

Thoughts?

A: 

There appear to be two possibilities. One is that you have not perhaps opened the output file correctly:

"If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference." ( http://docs.python.org/library/csv.html#module-csv )

If that is not the problem, then another option for you is to use codecs.EncodedFile(file, input[, output[, errors]]) as a wrapper to output your .csv:

http://docs.python.org/library/codecs.html#module-codecs

This will allow you to have the file object filter from incoming UTF16 to UTF8. While both of them are technically "unicode", the way they encode is very different.

Something like this:

rbc = open('file.csv','w')
bc = codecs.EncodedFile(rbc, "UTF16", "UTF8")
bcw = csv.writer(bc,csv.excel)

may resolve the problem for you, assuming I understood the problem right, and assuming that the error is thrown when writing to the file.

Christopher
Well, it gave a different error message (this time before writing anything to the file):UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 938: truncated data
anschauung
A: 

Looks like you've got 2 problems.

There's something screwed up in that cell - '7' should be encoded as u'x37' I think, since it's within the ASCII-range.

More importantly though, the fact that you're getting an error message specifying that the ascii codec can't be used suggests something's wrong with your encoding into unicode - it thinks you're trying to encode a value 0xed that can't be represented in ASCII, but you said you're trying to represent it in unicode.

I'm not smart enough to work out what particular line is causing the problem - if you edit your question to tell me what line's causing that error message I might be able to help a bit more (I guess it's either this_row.append(s.cell_value(row,col)) or bcw.writerow(this_row), but would appreciate you confirming).

Dominic Rodger
Thanks! The error is on bcw.writerow. Everything outputs correctly if, say, I use print this_row.As best I can tell, there's nothing obviously wrong with the '7' -- it outputs correctly (as u'516-7773167') when I print to stdout.
anschauung
Then it looks like `bcw.writerow` is expecting ASCII - are you sure you've got your arguments to `csv.writer` correct (see http://docs.python.org/library/csv.html#csv.writer)? I'm bemused as to where the `0xed` is coming from.
Dominic Rodger
+3  A: 

I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:

this_row.append(s.cell_value(row,col))

to:

this_row.append(s.cell_value(row,col).encode('utf8'))

If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:

val = s.cell_value(row, col)
if isinstance(val, unicode):
    val = val.encode('utf8')
this_row.append(val)
Alex Martelli
Perfect! That did it. I suppose I didn't expect that the different value types would need to be treated differently.Thanks!
anschauung
+3  A: 

You asked for explanations, but some of the phenomena are inexplicable without your help.

(A) Strings in XLS files created by Excel 97 onwards are encoded in Latin1 if possible otherwise in UTF16LE. Each string carries a flag telling which was used. Earlier Excels encoded strings according to the user's "codepage". In any case, xlrd produces unicode objects. The file encoding is of interest only when the XLS file has been created by 3rd party software which either omits the codepage or lies about it. See the Unicode section up the front of the xlrd docs.

(B) Unexplained phenomenon:

This code:

bcw = csv.writer(bc,csv.excel,b.encoding)

causes the following error with Python 2.5, 2.6 and 3.1: TypeError: expected at most 2 arguments, got 3 -- this is about what I'd expect given the docs on csv.writer; it's expecting a filelike object followed by either (1) nothing (2) a dialect or (3) one or more formatting parameters. You gave it a dialect, and csv.writer has no encoding argument, so splat. What version of Python are you using? Or did you not copy/paste the script that you actually ran?

(C) Unexplained phenomena around traceback and what the actual offending data was:

"the_script.py", line 40, in <module>
this_row.append(str(s.cell_value(row,col)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)

FIRSTLY, there's a str() in the offending code line that wasn't in the simplified script -- did you not copy/paste the script that you actually ran? In any case, you shouldn't use str in general -- you won't get the full precision on your floats; just let the csv module convert them.

SECONDLY, you say """The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)""" --- it's difficult to imagine how the 7 gets lost off the end. I'd use something like this to find out exactly what the problematic data was:

try:
    str_value = str(s.cell_value(row, col))
except:
    print "row=%d col=%d cell_value=%r" % (row, col, s.cell_value(row, col))
    raise

That %r saves you from typing cell_value=%s ... repr(s.cell_value(row, col)) ... the repr() produces an unambiguous representation of your data. Learn it. Use it.

How did you arrive at "516-777316"?

THIRDLY, the error message is actually complaining about a unicode character u'\xed' at offset 5 (i.e. the sixth character). U+00ED is LATIN SMALL LETTER I WITH ACUTE, and there's nothing like that at all in "516-7773167"

FOURTHLY, the error location seems to be a moving target -- you said in a comment on one of the solutions: "The error is on bcw.writerow." Huh?

(D) Why you got that error message (with str()): str(a_unicode_object) attempts to convert the unicode object to a str object and in the absence of any encoding information uses ascii, but you have non-ascii data, so splat. Note that your object is to produce a csv file encoded in utf8, but your simplified script doesn't mention utf8 anywhere.

(E) """... s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors. The output isn't particularly desirable (text:u'516-7773167')"""

That's happening because the csv writer is calling the __str__ method of your Cell object, and this produces <type>:<repr(value)> which may be useful for debugging but as you say not so great in your csv file.

(F) Alex Martelli's solution is great in that it got you going. However you should read the section on the Cell class in the xlrd docs: types of cell are text, number, boolean, date, error, blank and empty. If you have dates, you are going to want to format them as dates not numbers, so you can't use isinstance() (and you may not want the function call overhead anyway) ... this is what the Cell.ctype attribute and Sheet.cell_type() and Sheet.row_types() methods are for.

(G) UTF8 is not Unicode. UTF16LE is not Unicode. UTF16 is not Unicode ... and the idea that individual strings would waste 2 bytes each on a UTF16 BOM is too preposterous for even MS to contemplate :-)

(H) Further reading (apart from the xlrd docs):

http://www.joelonsoftware.com/articles/Unicode.html
http://www.amk.ca/python/howto/unicode
John Machin
+1: Thanks for the great explanation, and the background links. This has made me realize I can't avoid educating myself on encoding any longer, and I appreciate your going into such detail even after the immediate problem has been resolved.
anschauung