ansaurus

Question

Python + PostgreSQL + strange ascii = UTF8 encoding error

Answer 1

+2 A:

The question starts with a false premise: "I have ascii strings which contain the character "\x80" to represent the euro symbol". ASCII characters are in the range "\x00" to "\x7F" inclusive.

The previously-accepted now-deleted answer operated under two gross misapprehensions (1) that locale == encoding (2) that the latin1 encoding maps "\x80" to a euro character.

In fact, all of the ISO-8859-x encodings map "\x80" to U+0080 which is one of the C1 control characters, not a euro character. Only 3 of those encodings (x in (7, 15, 16)) provide the euro character, as "\xA4". See this Wikipedia article.

You need to know what encoding your data is in. What machine was it created on? How? The locale it was created in (not necessarily yours) may give you a clue.

Note that "My data is encoded in latin1" is up there with "The cheque's in the mail" and "Of course I'll love you in the morning". Your data is probably encoded in one of the cp125x encodings found on Windows platforms. Note that all of them except cp1251 (Windows Cyrillic) map "\x80" to the euro character:

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

Update in response to the OP's comment """I'm reading this data from a file, e.g. open(fname).read(). It contains strings with \x80 in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs "\x80" for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro""":

This is a bit confusing: First you say "It contains strings with \x80 in them that represents the euro character" but later you say """I'm thinking I can assume that it outputs "\x80" for a euro character""" -- please explain.

Selecting an appropriate cp125x encoding: Where (geographical location) was the file created? In what language(s) is the text written? Any characters other than the presumed euro with values > "\x7f"? If so, which ones and what context are they used in?

** Update 2** If you don't "know how the program is written", neither you nor we can form an opinion on whether it always uses "\x80" for the euro character. Although doing otherwise would be monumental silliness, it can't be ruled out.

If the text is written in the English language and/or it is written in the USA, and/or it's written on a Windows platform, then it's reasonably certain that cp1252 is the way to go ... until you get evidence to the contrary, in which case you'd need to guess an encoding by yourself or answer the (what language, what locality) questions.

John Machin 2010-06-08 01:30:50

+1 for "You need to *know* what encoding your data is in." You *need* to *know*. +1 for "latin1 [doesn't map] '\x80' to a euro". +1 for finding the real encoding, which I was still looking for.

Thanatos 2010-06-08 01:37:12

@Thanatos: "real encoding": cp125x are the usual suspects.

John Machin 2010-06-08 01:47:18

yep i'm deifnitely one of the cp125x, so it worked on my given computer. i'll hard-code it instead. the accepted answer is correct except for using 'latin1' in that case, yes?

Claudiu 2010-06-08 04:06:40

@Claudiu: (1) I don't understand your use of the word "so". (2) No, the currently accepted answer is replete with confusion and error.

John Machin 2010-06-08 05:47:17

@John: I mean it happened to work on my machine. maybe it was pure chance. I'm reading this data from a file, e.g. `open(fname).read()`. It contains strings with `\x80` in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs `"\x80"` for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro.

Claudiu 2010-06-08 12:43:52

@AnonymousDriveByDownVoter: Please share your wisdom; explain what you don't like about the answer.

John Machin 2010-06-08 22:38:11

@John: Let me try to clarify. I have a text file that the program outputs. I know what the text should look like, e.g. a snippet might be "Mad € here". I notice that when I read the bytes from the file, I get "Mad \x80 here". I don't notice any other non-ASCII characters in the file. I was wondering if I can always assume that the program outputs "\x80" for the euro character, and if so, whether I can just use an encoding that happens to have "\x80" map to the euro character. I don't know how the program itself is written.

Claudiu 2010-06-09 15:20:19

ansaurus

tags:

views:

answers:

Python + PostgreSQL + strange ascii = UTF8 encoding error

related questions