views:

427

answers:

3

it works fine on 64 bit machines but for some reason will not work on python 2.4.3 on a 32-bit instance.

i get the error

'utf8' codec can't decode bytes in position 76-79: invalid data

for the code

try:        
    str(sourceresult.sourcename).encode('utf8','replace')
except:
    raise Exception(  repr(sourceresult.sourcename ) )

it returns 'kazamidori blog\xf9'

i have modified my site.py file to make UTF8 the default encoding, but still doesnt seem to be working.

A: 

"Invalid Data" usually means that the incoming data contained characters outside its character set.

This is often caused by, at some point, some data being encoded in a character set different than UTF-8.

For example, if the file a string is stored in was not converted into UTF-8 when you made UTF-8 the standard character set. (In Windows, you can usually specify a file's encoding in the "Save as..." dialog of your text editor)

Or, when data comes from a database that uses a different character set in either the tables, the connection, or both.

Check out where the data comes from, and what encodings are set along the way.

Pekka
+3  A: 

We need the following, and we need the exact output:

type(sourceresult.sourcename) # I suspect it's already a UTF-8 encoded string

repr(sourceresult.sourcename)

Like I said, I'm almost certain that your sourceresult.sourcename is already a UTF-8 encoded string.

Perhaps this might help a little.

EDIT: it seems your sourceresult.sourcename is encoded as cp1252. I don't know what mystring (that you reference in a comment) is. So, to get a UTF-8 encoded string, you need to do:

source_as_UTF8= sourceresult.sourcename.decode("cp1252").encode("utf-8")

However, the string being cp1252-encoded is not consistent with the error message you supplied.

ΤΖΩΤΖΙΟΥ
this is the repr 'kazamidori blog\\xf9'"this is the type <type 'str'>is there anyway to find out what type of string?
JiminyCricket
assuming that it was already UTF8, i tried this mystring.decode('utf8','replace') but that only return the first character of the string
JiminyCricket
i was able to fix it by doing(sourceresult.sourcename).decode('cp1252').encode('utf8')how were you able to tell that it was cp1252?
JiminyCricket
Because it's the "Windows Western" encoding, and thus the safest bet :) It also helped that the resulting "kazamidori blogù" has hits in Google. BTW, whenever you find that an answer is the one that solves your problem, you should click the checkmark (✓) under the answer's vote count.
ΤΖΩΤΖΙΟΥ
+1 Well spotted, ΤΖΩΤΖΙΟΥ. A wise man once said "If the encoding of some data is stated to be unknown or ISO-8859-1, it is in fact cp1252".
John Machin
thanks, good to know. i wanted to vote your post up, but i dont have enough reputation yet =(
JiminyCricket
A: 

I think the problem is with your use of the str() function. Keep in mind that str() returns narrow, i.e. 1-byte-per-character strings. If the input, sourceresult.sourcename, is unicode, then Python automatically encodes it in order to return a narrow string. By default it uses the system encoding, which is likely something like ISO-8859-1, to do this.

So you're getting the error because it doesn't make sense to call encode on a string that is already encoded. If you get rid of the str(), it should work.

DNS
hmm, good thought, but removing str() didnt work
JiminyCricket
Yeah, my answer is only applicable if, as you originally said, the source string is unicode. If it, as it now appears, isn't, then you'll need to figure out what the database is encoding it to before I can suggest anything.
DNS
yup sorry for the confusion. i thought it was unicode. the main problem here is that the data isnt a standard encodingi guess. i was able to fix it by doing(sourceresult.sourcename).decode('cp1252').encode('utf8')this is based on ΤΖΩΤΖΙΟΥ saying that it was cp1252, im curious to know how he found that out. will comment on his post.
JiminyCricket