I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt
Is there a fast way of doing this in Python without using s.replace(...).replace(...).replace(...)...? I've tried this on just a few characters to replace and the document stripping became really slow.
EDIT, my version of unutbu's code that doesn't seem to work:
# -*- coding: iso-8859-15 -*-
import unidecode
def ascii_map():
data={}
for num in range(256):
h=num
filename='x{num:02x}'.format(num=num)
try:
mod = __import__('unidecode.'+filename,
fromlist=True)
except ImportError:
pass
else:
for l,val in enumerate(mod.data):
i=h<<8
i+=l
if i >= 0x80:
data[i]=unicode(val)
return data
if __name__=='__main__':
s = u'“fancy“fancy2'
print(s.translate(ascii_map()))