ansaurus

Question

Python "denormalize" unicode combining characters

Answer 1

+1 A:

o = unicodedata.normalize('NFC', o)

Ignacio Vazquez-Abrams 2010-06-27 09:20:40

Answer 2

+1 A:

As I have commented, U+00AF is not a combining macron. But you can convert it into U+0020 U+0304 with an NFKD transform.

>>> unicodedata.normalize('NFKD', u'o\u00af')
u'o \u0304'

Then you could remove the space and get ō with NFC.

(Note that NFKD is quite aggressive on decomposition in a way that some semantics can be lost — anything that is "compatible" will be separated out. e.g.

'½' (U+008D) ↦ '1' '⁄' (U+2044) '2';
'²' (U+00B2) ↦ '2'
'①' (U+2460) ↦ '1'

etc.)

KennyTM 2010-06-27 09:26:11

Works like a charm! Thanks - I'd tried NFKD, but I didn't think of re-normalizing it again.

Simon7 2010-06-27 21:41:00

ansaurus

tags:

views:

answers:

Python "denormalize" unicode combining characters

related questions