I am writing a forum in Python. I want to strip input containing the right-to-left mark and things like that. Suggestions? Possibly a regular expression?
views:
50answers:
3If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:
>>> uc = u'aäöüb'
>>> uc.encode('ascii', 'ignore')
'ab'
The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:
comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
That of course, with the two statements in this order, would be a different error (the first one tries to access comment
but only the second one binds that name), but let's assume the two lines are interchanged, as follows:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')
This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment
is a byte string (no leading u
before the opening quote), but .encode
applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii
, and that of course fails because the string is full of non-ascii characters.
Inserting the leading u
in that literal would work:
comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')
(this of course leaves comment
empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')
here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment
empty).
It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:
import unicodedata
your_unicode_string= filter(
lambda c: unicodedata.category(c) != 'Cf',
your_unicode_string)