views:

2517

answers:

5

I have a database full of names like:

John Smith  
Scott J. Holmes  
Dr. Kaplan  
Ray's Dog  
Levi's  
Adrian O'Brien  
Perry Sean Smyre  
Carie Burchfield-Thompson  
Björn Árnason

There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.

I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:

john.smith  
scott.j.holmes  
dr.kaplan  
rays.dog  
levis
adrian.obrien  
perry.sean.smyre
carie.burchfieldthompson  
bjorn.arnason

So far I have:

Fullname.strip()  # get rid of leading/trailing white space
Fullname.lower() # make everything lower case


... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods
+6  A: 

Take a look at this link

Here is the code from the page

def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaninful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
    0xc6:'Ae', 0xc7:'C',
    0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
    0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
    0xd0:'Th', 0xd1:'N',
    0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
    0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
    0xdd:'Y', 0xde:'th', 0xdf:'ss',
    0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
    0xe6:'ae', 0xe7:'c',
    0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
    0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
    0xf0:'th', 0xf1:'n',
    0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
    0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
    0xfd:'y', 0xfe:'th', 0xff:'y',
    0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
    0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
    0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
    0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
    0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
    0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
    0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
    0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
    0xd7:'*', 0xf7:'/'
    }

r = ''
for i in unicrap:
    if xlate.has_key(ord(i)):
        r += xlate[ord(i)]
    elif ord(i) >= 0x80:
        pass
    else:
        r += i
return r

# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
    if c != 0x7f:
        s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
Soldier.moth
Your link takes me to "Britney Spears nude"
Mark
@Mark - Ha, hooray for the permanency of weblinks!
Soldier.moth
+1  A: 

I would do something like this

# coding=utf-8

def alnum_dot(name, replace={}):
    import re

    for k, v in replace.items():
     name = name.replace(k, v)

    return re.sub("[^a-z.]", "", name.strip().lower())

print alnum_dot(u"Frédrik Holmström", {
    u"ö":"o",
    " ":"."
})

Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped

thr
+1  A: 

The translate method allows you to delete characters. You can use that to delete arbitrary characters.

Fullname.translate(None,"'-\"")

If you want to delete whole classes of characters, you might want to use the re module.

re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)
lambacck
+1  A: 

The following function is generic:

import unicodedata

def not_combining(char):
        return unicodedata.category(char) != 'Mn'

def strip_accents(text, encoding):
        unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
        return filter(not_combining, unicode_text).encode(encoding)

# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα

Obviously, you should know the encoding of your strings.

ΤΖΩΤΖΙΟΥ
+3  A: 

If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode

The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.

With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).

The advantage of this module compared to the unicode normalization technique is simple. Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.

The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.

The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.

Usage of the module is straightforward::

from unidecode import unidecode
var_utf8  = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"

Note that I have nothing to do with this module directly. It just happens that I find it very useful.

Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.

exhuma
I submitted a patch yesterday to fix the issues with the mentioned kana replacements. It's already been merged!
exhuma