I need to change some characters that are not ASCII to '_'. For example,
Tannh‰user -> Tannh_user
- If I use regular expression with Python, how can I do this?
- Is there better way to do this not using RE?
I need to change some characters that are not ASCII to '_'. For example,
Tannh‰user -> Tannh_user
How to do it using built-in str.decode
method:
>>> 'Tannh‰user'.decode('ascii', 'replace').replace(u'\ufffd', '_')
u'Tannh___user'
(You get unicode
string, so convert it to str
if you need.)
You can also convert unicode
to str
, so one non-ASCII character is replaced by ASCII one. But the problem is that unicode.encode
with replace
translates non-ASCII characters into '?'
, so you don't know if the question mark was there already before; see solution from Ignacio Vazquez-Abrams.
Another way, using ord()
and comparing value of each character if it fits in ASCII range (0-127) - this works for unicode
strings and for str
in utf-8, latin and some other encodings:
>>> s = u'Tannh‰user'
>>> "".join((c if ord(c) < 128 else '_' for c in s))
u'Tannh_user'
if you know which characters you want to replace, you can apply string methods
mystring.replace('oldchar', 'newchar')
re.sub(r'[^\x00-\x7F]', '_', theString)
This will work if theString is unicode, or a string in an encoding where ASCII occupies values 0 to 0x7F (latin-1, UTF-8, etc.).
I'd rather just call ord
on every character in the string, 1 by 1. If ord([char]) >= 128
the character is not an ascii character and should be replaced.
Using Python's support for character encodings:
# coding: utf8
import codecs
def underscorereplace_errors(exc):
return (u'_', exc.end)
codecs.register_error('underscorereplace', underscorereplace_errors)
print u'Tannh‰user'.encode('ascii', 'underscorereplace')