ansaurus

Question

Answer 1

A:

Try using the "ISO-8859-1" for your encoding. It seems like you are dealing with extended ASCII, not Unicode.

Edit:

Here's some simple code that deals with extended ASCII:

>>> s = "La Pe\xf1a"
>>> print s
La Pe±a
>>> print s.decode("latin-1")
La Peña
>>>

Even better, dealing with the exact character that is giving you problems:

>>> s = "12\xa3"
>>> print s.decode("latin-1")
12£
>>>

Stargazer712 2010-08-13 19:10:06

Do you mean use: yield [unicode(cell, 'ISO-8859-1') for cell in row] instead, in the unicode_csv_reader function? Unfortunately that doesn't help - back to the ordinal not in range(128) error again.

AP257 2010-08-13 19:18:55

It wouldn't make much sense to use a function called unicode() when dealing with ASCII. What I am saying is that you are dealing with a file that is encoded using a "ISO-8859-1" encoding. I didn't post any code, because I don't know how to do it off the top of my head, but your problem is that you need to decode it as ISO-8859-1, not Unicode.

Stargazer712 2010-08-13 19:21:59

OK, thanks. I'll investigate. How did you know it was ISO-8859-1? In other words, is there a way for me to check encodings myself, rather than just ask dumb questions on StackOverflow :)

AP257 2010-08-13 19:24:10

Not a dumb question at all. I had to work on a project where we were working on a web scraping tool, and we needed to scrape international sites. I spent two full weeks immersing myself in the intricate details of encoding, and to this day I am one of the few at my workplace who has a firm grasp over them.

Stargazer712 2010-08-13 19:25:57

the code works - thank you :)

AP257 2010-08-13 19:31:47

@Stargazer: (1) UTF-8 is not Unicode. (2) ISO-8859-n maps `\xa3` to `U+00A3 POUND SIGN` for n in (1, 3, 7, 8, 9, 13, 14, 15). Please answer the OP's question: How did you "know" it was ISO-8859-1?

John Machin 2010-08-13 22:05:41

@John Machin: (1) - I don't really care. (2) - The character being larger than 127 implies that it is not ascii, and the fact that it is not decoding as Unicode or UTF-8 implies that it is most likely some form of extended ASCII. From personal experience, I've seen ISO-8859-1 is one of the most popular encodings for those who speak Western-style languages (English, Spanish, French, German, etc.). How did I "know"? I didn't. I went with what was most likely, which worked just fine.

Stargazer712 2010-08-16 15:36:05

John Machin 2010-08-16 22:34:00

Answer 2

A:

If you are on Windows, it is highly likely that the encoding that you should use is one of the cp125X family ... e.g. if you are in Western Europe or the Americas, it will be cp1252. Windows software often uses bytes in the range \x80 to \x9F inclusive to encode fancy punctuation characters whereas that range is reserved in ISO-8859-X for the rarely used "C1 Control Characters".

You can find out the usual encoding in your locale by running this at the command line:

python -c "import locale; print locale.getpreferredencoding()"

John Machin 2010-08-13 21:52:02

He is having difficulty reading £ signs, and you're assuming that the file was conveniently saved on whatever settings *his* computer prefers? I would be careful making the assumption that the file is something that was saved using his machine.

Stargazer712 2010-08-16 15:40:06

@Stargazer712: No, I'm not assuming anything. I'm suggesting that it is highly likely that the file was created on a machine in the same locale and using the same operating system as the machine the OP is using.

John Machin 2010-08-16 22:01:00

@John: My experience with encodings (as I mentioned earlier) came from scraping the web. I assure you it is not a safe assumption.

Stargazer712 2010-08-17 04:41:52

@Stargazer712: Which part of "I'm not assuming anything" don't you understand? I'm suggesting that the OP should check whether cp125X might not be more appropriate, i.e. more future-proof.

John Machin 2010-08-17 05:05:59

@John: "I'm suggesting that it is highly likely that the file was created on a machine in the same locale..." -- That's an assumption, and I'm done talking about this.

Stargazer712 2010-08-17 14:42:34

ansaurus

tags:

views:

answers:

Python csv: UnicodeDecodeError

related questions