ansaurus

Question

UnicodeEncodeError: 'latin-1' codec can't encode character

Answer 1

+2 A:

You are trying to store a Unicode codepoint \u201c using an encoding ISO-8859-1 / Latin-1 that can't describe that codepoint. Either you might need to alter the database to use utf-8, and store the string data using an appropriate encoding, or you might want to sanitise your inputs prior to storing the content; i.e. using something like Sam Ruby's excellent i18n guide. That talks about the issues that windows-1252 can cause, and suggests how to process it, plus links to sample code!

jabley 2010-10-15 14:02:35

Answer 2

A:

Latin-1 (aka ISO 8859-1) is a single octet character encoding scheme, and you can't fit \u201c (“) into a byte.

Did you mean to use UTF-8 encoding?

msw 2010-10-15 14:03:25

Latin-1 encodes _specific_ Unicode characters, just not that one. It doesn't matter if \u201c can't fit in a byte. windows-1252 is a single octet encoding scheme also, and _does_ including \u201c.

Mark Tolonen 2010-10-15 18:21:14

cp1253 (aka windows-1253) is also a single octet character encoding scheme, and yet `\u0391` fits fine in a byte (specifically, byte 193). You *might* want to take a look at [that](http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode/370199#370199); people have found it helpful.

ΤΖΩΤΖΙΟΥ 2010-10-15 19:30:21

Unicode incorporates Latin-1/cp1253 glyphs in as 16-bit codepoints. I'm surprised that the comments seem to be claiming the converse.

msw 2010-10-16 04:11:13

Answer 3

+1 A:

I hope your Database is at least UTF-8. Then you need to yourstring.encode('utf-8') bevor you try putting it into the databse

knitti 2010-10-15 14:14:23

Answer 4

+6 A:

Character U+201C Left Double Quotation Mark is not present in the Latin-1 (ISO-8859-1) encoding.

It is present in code page 1252 (Western European). This is a Windows-specific encoding that is based on ISO-8859-1 but which puts extra characters into the range 0x80-0x9F. Code page 1252 is often confused with ISO-8859-1, and it's an annoying but now-standard web browser behaviour that if you serve your pages as ISO-8859-1, the browser will treat them as cp1252 instead. However, they really are two distinct encodings:

>>> u'He said \u201CHello\u201D'.encode('iso-8859-1')
UnicodeEncodeError
>>> u'He said \u201CHello\u201D'.encode('cp1252')
'He said \x93Hello\x94'

If you are using your database only as a byte store, you can use cp1252 to encode “ and other characters present in the Windows Western code page. But still other Unicode characters which are not present in cp1252 will cause errors.

You can use encode(..., 'ignore') to suppress the errors by getting rid of the characters, but really in this century you should be using UTF-8 in both your database and your pages. This encoding allows any character to be used. You should also ideally tell MySQL you are using UTF-8 strings (by setting the database connection and the collation on string columns), so it can get case-insensitive comparison and sorting right.

bobince 2010-10-15 14:22:20

Isn't `cp1252` a strict superset of ISO-8859-1? I.e. when browsers receive an ISO-8859-1 page, they can render it as if it was CP1252 because there won't be any characters from the range `0x80-0x9F` anyway.

MSalters 2010-10-15 14:45:20

No, the bytes 0x80–0x9F do have real assignments in ISO-8859-1, which are overridden by cp1252's additions so it's not a superset. They map exactly to the Unicode characters U+0080–U+009F, which are a selection of control characters. They're control characters that aren't used very much which is why browsers got away with it, but it's annoying when you are trying to convert a sequences of bytes-as-Unicode.

bobince 2010-10-15 15:03:16

@bobince: The only time that I've ever seen characters in the range U+0080-U+009F in a file encoded as ISO-8859-1 or UTF-8 resulted from some clown concatenating a bunch of files some of which were encoded in cp850 and then transcoding the resultant mess from "latin1" to UTF-8. The draft HTML5 spec is considering sanctifying that very practical browser behaviour (and a whole bunch of similar cases) -- see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0

John Machin 2010-10-18 23:39:40

ansaurus

tags:

views:

answers:

UnicodeEncodeError: 'latin-1' codec can't encode character

related questions