Yes, UTF-8 != Unicode.
UTF-8 is a specifc string encoding, as are ASCII and ISO 8859-1. Try this:
For any input string do a inputstring.decode('utf-8')
(or whatever input encoding you get). For any output string do a outputstring.encode('utf-8')
(or whatever output encoding you want). For any internal use, take unicode strings ('this is a normal string'.decode('utf-8') == u'this is a normal string'
)
'foo'
is a string, u'foo'
is a unicode string, which doesn't "have" an encoding (can't be decoded). SO anytime python want to change an encoding of a normal string, it first tries to "decode" it, the to "encode" it. And the default is "ascii", which fails more often than not :-)
knitti
2010-09-19 01:52:12