ansaurus

Question

Removing right-to-left mark and other unicode characters from input in Python

Answer 1

A:

If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:

>>> uc = u'aäöüb'
>>> uc.encode('ascii', 'ignore')
'ab'

sth 2010-06-01 00:45:14

27 comment = comment.encode('ascii', 'ignore')comment = '\xc3\xa4\xc3\xb6\xc3\xbc', comment.encode = <built-in method encode of str object at 0x11db40>UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) args = ('ascii', '\xc3\xa4\xc3\xb6\xc3\xbc', 0, 1, 'ordinal not in range(128)') encoding = 'ascii' end = 1 object = '\xc3\xa4\xc3\xb6\xc3\xbc' reason = 'ordinal not in range(128)' start = 0

Earl Bellinger 2010-06-01 00:52:21

Your `comment` doesn't seem to be a unicode object, but a string. It seems to be UTF-8 encoded, so you first need to decode it. With `comment = comment.decode('utf-8')` you convert it to the corresponding unicode object.

sth 2010-06-01 01:12:45

For anyone curious to the end product: if uc.decode('utf-8') != uc.decode('utf-8').encode('ascii', 'ignore'): return

Earl Bellinger 2010-06-29 05:24:55

Answer 2

A:

The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:

comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'

That of course, with the two statements in this order, would be a different error (the first one tries to access comment but only the second one binds that name), but let's assume the two lines are interchanged, as follows:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment is a byte string (no leading u before the opening quote), but .encode applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii, and that of course fails because the string is full of non-ascii characters.

Inserting the leading u in that literal would work:

comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

(this of course leaves comment empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')

here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment empty).

Alex Martelli 2010-06-01 01:04:23

Sorry for the hard to read comment. Because the user passes the contents of the comment to my script, how do I add the leading u? I am doing: "comment = form.getvalue(key)" and then trying to change it into ascii from there.

Earl Bellinger 2010-06-01 01:24:30

@Earl, if the user is passing you a bytestring with some encoding, you need to use the last snippet I gave in my answer: explicitly decode it to unicode, then encode that unicode back to ascii while skipping non-ascii characters. But you have to know (or, worst-case, guess!-) what encoding the user is using (guessing _ought_ not to be needed since that information should hopefully be part of the `document-type` header in the HTTP request you're handling!-).

Alex Martelli 2010-06-01 01:51:01

Answer 3

A:

It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:

import unicodedata

your_unicode_string= filter(
    lambda c: unicodedata.category(c) != 'Cf',
    your_unicode_string)

ΤΖΩΤΖΙΟΥ 2010-06-26 08:00:05

ansaurus

tags:

views:

answers:

Removing right-to-left mark and other unicode characters from input in Python

related questions