You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""
In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:
Note: All strings will be displayed using (implicitly) repr()
. unicodedata.name()
will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.
Initial state: you have a unicode object that you have named u1. It contains e-acute:
>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'
You encode u1 as UTF-8 and name the result s:
>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'
You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.
>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>
Please understand: unicode_object.encode('x').decode('y)
when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').
Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x')
as suggested in various answers to your question.