ansaurus

Question

Answer 1

+6 A:

'\xe2' is one character, \x is an escape sequence that's followed by a hex number and used to specify a byte literally.
That means you have to specify the whole expression:

>>> s = '\xe2hello'
>>> print s
'\xe2hello'
>>> s.replace('\xe2', '')
'hello'

More information can be found in the Python docs.

abyx 2010-07-25 11:32:14

hi!thanks for the explanation!!

pythonisgr8 2010-07-25 15:36:38

Answer 2

+5 A:

You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.

cypheon 2010-07-25 11:40:36

hi,thanks for the reply!this is working great!!though the regex would fit slightly better for my project.

pythonisgr8 2010-07-25 15:35:29

Answer 3

+2 A:

It helps here to understand the difference between a string literal and a string.

A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.

For example, the string literal " a " produces the string a.

String literals can take a number of forms. All of these produce the same string a:

"a"
'a'
r"a"
"""a"""
r'''a'''

Source code is traditionally ASCII-only, but we'd like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.

This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.

To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don't want:

if re.search(r"[\x90-\xff]", a):

Ned Batchelder 2010-07-25 13:18:04

thanks for the explanation!this is working perfect!!

pythonisgr8 2010-07-25 15:32:18

Answer 4

+1 A:

Let's stand back and think about this a little bit ...

You're using nltk (natural language toolkit) to parse (presumably) natural language.

Your '\xe2' is highly likely to represent U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â).
Your '\xe3' is highly likely to represent U+00E3 LATIN SMALL LETTER A WITH TILDE (ã).

They look like natural language letters to me. Are you SURE that you don't need them?

John Machin 2010-07-25 13:31:58

hi,thanks for the reply!actually i am trying to extract numbers from the webpage ,so i dont need latin characteres.

pythonisgr8 2010-07-25 15:31:00

@pythonisgr8: (1) You are using nltk to extract numbers?? (2) "Latin" doesn't mean "accented"; almost all of the characters in your comment are "Latin" (3) If you are extracting numbers only, then it doesn't matter whether the `'a'` letters in `'abracadabra'` have accents or not; you don't need to delete characters that you don't want in order to extract characters that you do want. Perhaps you should ask another question describing what you are trying to do.

John Machin 2010-07-25 22:10:36

Answer 5

+2 A:

I see other answers have done a good job in explaining your confusion with respect to '\x', but while suggesting that you may not want to completely remove non-ASCII characters, have not provided a specific way to do other normalization beyond such removing.

If you want to obtain some "reasonably close ASCII character" (e.g., strip accents from letters but leave the underlying letter, &c), this SO answer may help -- the code in the accepted answer, using only the standard Python library, is:

import unicodedata

def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

Of course, you'll need to apply this function to each string item in the list you mention in the title, e.g

cleanedlist = [strip_accents(s) for s in mylist]

if all items in mylist are strings.

Alex Martelli 2010-07-25 15:32:12

thanks for the reply!though i do not need non-ASCII characters at present as i am extracting numbers and their contexts,your answer might be helpful in the future!!

pythonisgr8 2010-07-25 15:45:22

ansaurus

tags:

views:

answers:

how to remove '\xe2' from a list

related questions