ansaurus

Question

Answer 1

+5 A:

import unicodedata as ud

astr=u"\N{LATIN SMALL LETTER E}" + u"\N{COMBINING ACUTE ACCENT}"
combined_astr=ud.normalize('NFC',astr)

'NFC' tells ud.normalize to apply the canonical decomposition ('NFD'), then compose pre-combined characters:

print(ud.name(combined_astr))
# LATIN SMALL LETTER E WITH ACUTE

They both print the same:

print(astr)
# é
print(combined_astr)
# é

But their reprs are different:

print(repr(astr))
# u'e\u0301'
print(repr(combined_astr))
# u'\xe9'

And their encodings, in say utf_8, are (not surprisingly) different too:

print(repr(astr.encode('utf_8')))
# 'e\xcc\x81'
print(repr(combined_astr.encode('utf_8')))
# '\xc3\xa9'

unutbu 2010-10-02 13:10:50

From your repr examples this looks exactly like what I need. Thank you for taking the time to reply! Answer accepted.

andreb 2010-10-02 14:06:49

Precompose Unicode Character Sequences in Python