ansaurus

Question

Replace national characters with ASCII equivalent.

Answer 1

+3 A:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata
text = u'Cześć'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')

nosklo 2010-07-07 12:19:56

'NFKD' would give you ASCII output more often than 'NFD' would.

dan04 2010-07-12 06:11:37

Answer 2

+2 A:

You can get most of the way by doing:

import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')

Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

Æ → AE
Ð → D
Ø → O
Þ → TH
ß → ss
æ → ae
ð → d
ø → o
þ → th
Œ → OE
œ → oe
ƒ → f

dan04 2010-07-12 06:10:54

Answer 3

+1 A:

The unicodedata.normalize gimmick can best be described as half-assci. Here is a robust approach which includes a map for letters with no decomposition. Note the additional map entries in the comments.

John Machin 2010-07-12 07:13:14

ansaurus

tags:

views:

answers:

Replace national characters with ASCII equivalent.

related questions