ansaurus

Question

Python String Cleanup + Manipulation (Accented Characters)

Answer 1

+6 A:

Take a look at this link

Here is the code from the page

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
    0xc6:'Ae', 0xc7:'C',
    0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
    0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
    0xd0:'Th', 0xd1:'N',
    0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
    0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
    0xdd:'Y', 0xde:'th', 0xdf:'ss',
    0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
    0xe6:'ae', 0xe7:'c',
    0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
    0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
    0xf0:'th', 0xf1:'n',
    0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
    0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
    0xfd:'y', 0xfe:'th', 0xff:'y',
    0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
    0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
    0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
    0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
    0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
    0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
    0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
    0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
    0xd7:'*', 0xf7:'/'
    }

r = ''
for i in unicrap:
    if xlate.has_key(ord(i)):
        r += xlate[ord(i)]
    elif ord(i) >= 0x80:
        pass
    else:
        r += i
return r

# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
    if c != 0x7f:
        s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)

Soldier.moth 2009-05-30 18:47:38

Your link takes me to "Britney Spears nude"

Mark 2010-06-27 21:14:41

@Mark - Ha, hooray for the permanency of weblinks!

Soldier.moth 2010-06-28 03:12:44

Answer 2

+1 A:

I would do something like this

# coding=utf-8

def alnum_dot(name, replace={}):
    import re

    for k, v in replace.items():
     name = name.replace(k, v)

    return re.sub("[^a-z.]", "", name.strip().lower())

print alnum_dot(u"Frédrik Holmström", {
    u"ö":"o",
    " ":"."
})

Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped

thr 2009-05-30 18:51:11

Answer 3

+1 A:

The translate method allows you to delete characters. You can use that to delete arbitrary characters.

Fullname.translate(None,"'-\"")

If you want to delete whole classes of characters, you might want to use the re module.

re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)

lambacck 2009-05-30 18:53:34

Answer 4

+1 A:

The following function is generic:

import unicodedata

def not_combining(char):
        return unicodedata.category(char) != 'Mn'

def strip_accents(text, encoding):
        unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
        return filter(not_combining, unicode_text).encode(encoding)

# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα

Obviously, you should know the encoding of your strings.

ΤΖΩΤΖΙΟΥ 2009-05-31 01:43:14

Answer 5

+3 A:

If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode

The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.

With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).

The advantage of this module compared to the unicode normalization technique is simple. Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.

The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.

The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. ~~It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.~~

Usage of the module is straightforward::

from unidecode import unidecode
var_utf8  = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"

Note that I have nothing to do with this module directly. It just happens that I find it very useful.

Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.

exhuma 2009-08-26 08:36:53

I submitted a patch yesterday to fix the issues with the mentioned kana replacements. It's already been merged!

exhuma 2009-08-27 09:45:14

ansaurus

tags:

views:

answers:

Python String Cleanup + Manipulation (Accented Characters)

related questions