views:

3596

answers:

6

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don't seem very 'pythonic' to me.

For the record, I don't just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.

+10  A: 

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)
Ants Aasma
What does the plus sign do in the regexp? (I know what it means, just curious as to why it's needed for the re.sub.)
Mark van Lent
@Mark: I imagine it would speed up the substitution as the replace will get rid of all non-word characters in a block in one go, rather than removing them one-by-one.
Al
Yeah, I benched that while tuning some performance critical code a while ago. If there are significant spans of characters to replace the speedup is huge.
Ants Aasma
Thanks for the clarification.
Mark van Lent
It might not be relevant in this case, but `\W` will keep underscores as well.
Blixt
+2  A: 

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.

Al
It seems that string.ascii_letters only contains letters (duh) and not numbers. I also need the numbers...
Mark van Lent
Adding string.digits would indeed solve the problem I just mentioned. :)
Mark van Lent
Yes, I realised that when I went back to read your question. Note to self: learn to read!
Al
+2  A: 
>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13
DisplacedAussie
+3  A: 

You could try:

print ''.join(ch for ch in some_string if ch.isalnum())
ars
+15  A: 

I just timed some functions out of curiosity

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop
Otto Allmendinger
Very interesting results: I would have expected the regular expressions to be slower. Interestingly, I tried this with one other option (`valid_characters = string.ascii_letters + string.digits` followed by `join(ch for ch in string.printable if ch in valid_characters)` and it was 6 microseconds quicker than the `isalnum()` option. Still much slower than the regexp though.
Al
+1, measuring time is good! (but in the penultimate, do `pattern.sub('', string.printable)` instead -- silly to call re.sub when you have a RE object!-).
Alex Martelli
thanks alex, I didn't read the docs carefully enough!
Otto Allmendinger
On my machine, the \W, \W+ regular expressions still leave a _.
Nick Presta
True, the documentation says this too. I'll update the listing.
Otto Allmendinger
For the record: use `re.compile('[\W_]+', re.UNICODE)` to make it unicode safe.
Mark van Lent
+3  A: 

Use the str.translate() method.

Presuming you will be doing this often:

(1) Once, create a string containing all the characters you wish to delete:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())

(2) Whenever you want to scrunch a string:

scrunched = s.translate(None, delchars)

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern '[\W_]+' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here's what happens if you give re.sub a bit more work to do:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop
John Machin
Using translate is indeed quite a bit faster. Even when adding a for loop right before doing the substitution/translation (to make the setup costs weigh in less) still makes the translation roughly 17 times faster than the regexp on my machine. Good to know.
Mark van Lent