tags:

views:

1963

answers:

4

I want to convert a number of unicode codepoints read from a file to their UTF8 encoding.

e.g I want to convert the string 'FD9B' to the string 'EFB69B'.

I can do this manually using string literals like this:

u'\uFD9B'.encode('utf-8')

but I cannot work out how to do it programatically.

+2  A: 

Use the built-in function unichr() to convert the number to character, then encode that:

>>> unichr(int('fd9b', 16)).encode('utf-8')
'\xef\xb6\x9b'

This is the string itself. If you want the string as ASCII hex, you'd need to walk through and convert each character c to hex, using hex(ord(c)) or similar.

unwind
The output is not as specified by the question. Anyway, if the OP is happy…
ΤΖΩΤΖΙΟΥ
+1  A: 
Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u'\uFD9B'.encode('utf-8')
'\xef\xb6\x9b'
>>> s = 'FD9B'
>>> i = int(s, 16)
>>> i
64923
>>> unichr(i)
u'\ufd9b'
>>> _.encode('utf-8')
'\xef\xb6\x9b'
Virgil Dupras
A: 
data_from_file='\uFD9B'
unicode(data_from_file,"unicode_escape").encode("utf8")
pixelbeat
A: 

If the input string length is a multiple of 4 (i.e. your unicode code points are UCS-2 encoded), then try this:

import struct

def unihex2utf8hex(arg):
    count= len(arg)//4
    uniarr= struct.unpack('!%dH' % count, arg.decode('hex'))
    return u''.join(map(unichr, uniarr)).encode('utf-8').encode('hex')

>>> unihex2utf8hex('fd9b')
'efb69b'
ΤΖΩΤΖΙΟΥ