ansaurus

Question

Python UTF-8 can't decode byte on 32-bit machine

Answer 1

A:

"Invalid Data" usually means that the incoming data contained characters outside its character set.

This is often caused by, at some point, some data being encoded in a character set different than UTF-8.

For example, if the file a string is stored in was not converted into UTF-8 when you made UTF-8 the standard character set. (In Windows, you can usually specify a file's encoding in the "Save as..." dialog of your text editor)

Or, when data comes from a database that uses a different character set in either the tables, the connection, or both.

Check out where the data comes from, and what encodings are set along the way.

Pekka 2010-04-01 18:39:38

Answer 2

+3 A:

We need the following, and we need the exact output:

type(sourceresult.sourcename) # I suspect it's already a UTF-8 encoded string

repr(sourceresult.sourcename)

Like I said, I'm almost certain that your sourceresult.sourcename is already a UTF-8 encoded string.

Perhaps this might help a little.

EDIT: it seems your sourceresult.sourcename is encoded as cp1252. I don't know what mystring (that you reference in a comment) is. So, to get a UTF-8 encoded string, you need to do:

source_as_UTF8= sourceresult.sourcename.decode("cp1252").encode("utf-8")

However, the string being cp1252-encoded is not consistent with the error message you supplied.

ΤΖΩΤΖΙΟΥ 2010-04-01 19:00:03

this is the repr 'kazamidori blog\\xf9'"this is the type <type 'str'>is there anyway to find out what type of string?

JiminyCricket 2010-04-01 19:16:55

assuming that it was already UTF8, i tried this mystring.decode('utf8','replace') but that only return the first character of the string

JiminyCricket 2010-04-01 19:18:11

i was able to fix it by doing(sourceresult.sourcename).decode('cp1252').encode('utf8')how were you able to tell that it was cp1252?

JiminyCricket 2010-04-01 19:36:22

Because it's the "Windows Western" encoding, and thus the safest bet :) It also helped that the resulting "kazamidori blogù" has hits in Google. BTW, whenever you find that an answer is the one that solves your problem, you should click the checkmark (✓) under the answer's vote count.

ΤΖΩΤΖΙΟΥ 2010-04-01 21:08:41

+1 Well spotted, ΤΖΩΤΖΙΟΥ. A wise man once said "If the encoding of some data is stated to be unknown or ISO-8859-1, it is in fact cp1252".

John Machin 2010-04-01 21:54:38

thanks, good to know. i wanted to vote your post up, but i dont have enough reputation yet =(

JiminyCricket 2010-04-01 22:02:24

Answer 3

A:

I think the problem is with your use of the str() function. Keep in mind that str() returns narrow, i.e. 1-byte-per-character strings. If the input, sourceresult.sourcename, is unicode, then Python automatically encodes it in order to return a narrow string. By default it uses the system encoding, which is likely something like ISO-8859-1, to do this.

So you're getting the error because it doesn't make sense to call encode on a string that is already encoded. If you get rid of the str(), it should work.

DNS 2010-04-01 19:04:44

hmm, good thought, but removing str() didnt work

JiminyCricket 2010-04-01 19:09:59

Yeah, my answer is only applicable if, as you originally said, the source string is unicode. If it, as it now appears, isn't, then you'll need to figure out what the database is encoding it to before I can suggest anything.

DNS 2010-04-01 19:21:30

yup sorry for the confusion. i thought it was unicode. the main problem here is that the data isnt a standard encodingi guess. i was able to fix it by doing(sourceresult.sourcename).decode('cp1252').encode('utf8')this is based on ΤΖΩΤΖΙΟΥ saying that it was cp1252, im curious to know how he found that out. will comment on his post.

JiminyCricket 2010-04-01 19:35:58

ansaurus

tags:

views:

answers:

Python UTF-8 can't decode byte on 32-bit machine

related questions