ansaurus

Question

How do I reverse Unicode decomposition using Python?

Answer 1

+1 A:

I can't really give you a definitive answer to your question because I never tried that. But there is a unicodedata module in the standard library. It has two functions decomposition() and normalize() that might help you here.

Edit: Make sure that it really is decomposed unicode. Sometimes there are weird ways to write characters that can't be directly expressed in an encoding. Like "a which is meant to be mentally parsed by a human or some specialized program as ä.

unbeknown 2009-01-15 10:18:38

You're right, it's not actually proper decomposed unicode - see my comment on Rafał Dowgird's answer.

msanders 2009-01-15 10:53:06

Answer 2

+4 A:

I think you are looking for this:

>>> import unicodedata    
>>> print unicodedata.normalize("NFC",u"c\u0327")
ç

Rafał Dowgird 2009-01-15 10:33:47

Yes, this works - assuming I really do have decomposed unicode. Unfortunately it seems I actually have (for example) \u00B8 (cedilla) instead of \u0327 (combining cedilla) in my text. Looks like I will need to either map these chars to their combining equivalent or just strip them entirely. Thanks.

msanders 2009-01-15 10:51:46

Answer 3

+1 A:

Unfortunately it seems I actually have (for example) \u00B8 (cedilla) instead of \u0327 (combining cedilla) in my text.

Eurgh, nasty! You can still do it automatically, though the process wouldn't be entirely lossless as it involves a compatibility decomposition (NFKD).

Normalise U+00B8 to NFKD and you'll get a space followed by the U+0327. You could then scan through the string looking for any case of space-followed-by-combining-character, and remove the space. Finally recompose to NFC to put the combining characters onto the previous character instead.

s= unicodedata.normalize('NFKD', s)
s= ''.join(c for i, c in enumerate(s) if c!=' ' or unicodedata.combining(s[i+1])==0)
s= unicodedata.normalize('NFC', s)

bobince 2009-01-15 14:55:26

ansaurus

tags:

views:

answers:

How do I reverse Unicode decomposition using Python?

related questions