views:

542

answers:

4

Hi,

I have a set of UTF-8 octets and I need to convert them back to unicode code points. How can I do this in python.

e.g. UTF-8 octet ['0xc5','0x81'] should be converted to 0x141 codepoint.

+4  A: 

I'm assuming pre-3.x...

Put them in a str, and either call unicode with the string and 'utf-8':

>>> unicode('\xc5\x81', 'utf-8')
u'\u0141'

Or call .decode('utf-8') on the str:

>>> '\xc5\x81'.decode('utf-8')
u'\u0141'

If by "octet" you really mean a string in the form '0xc5' (rather than '\xc5') you can convert them like this:

>>> ''.join(chr(int(x,0)) for x in ['0xc5', '0x81'])
'\xc5\x81'
Laurence Gonsalves
+1, well explained!
S.Mark
+1: Also, int(x,0) has the advantage of permitting mixed base octect string, for example, ['0xc5', '0x81', '0305', '0201'] are the hex and octal representations of the same. int(x,16) would misinterpret the octal strings in this input
mhawke
+3  A: 
l = ['0xc5','0x81']
s = ''.join([chr(int(c, 16)) for c in l]).decode('utf8')
s
>>> u'\u0141'
mhawke
+1  A: 
>>> "".join((chr(int(x,16)) for x in ['0xc5','0x81'])).decode("utf8")
u'\u0141'
S.Mark
+1  A: 

In lovely 3.x, where all strs are Unicode, and bytes are what strs used to be:

>>> s = str(bytes([0xc5, 0x81]), 'utf-8')
>>> s
'Ł'
>>> ord(s)
321
>>> hex(ord(s))
'0x141'

Which is what you asked for.

Don O'Donnell