ansaurus

Question

Stripping everything but alphanumeric chars from a string in Python

Answer 1

+10 A:

Regular expressions to the rescue:

import re
re.sub(r'\W+', '', your_string)

Ants Aasma 2009-08-14 08:57:37

What does the plus sign do in the regexp? (I know what it means, just curious as to why it's needed for the re.sub.)

Mark van Lent 2009-08-14 09:03:44

@Mark: I imagine it would speed up the substitution as the replace will get rid of all non-word characters in a block in one go, rather than removing them one-by-one.

Al 2009-08-14 09:07:10

Yeah, I benched that while tuning some performance critical code a while ago. If there are significant spans of characters to replace the speedup is huge.

Ants Aasma 2009-08-14 09:25:25

Thanks for the clarification.

Mark van Lent 2009-08-14 09:27:43

It might not be relevant in this case, but `\W` will keep underscores as well.

Blixt 2009-08-14 16:20:37

Answer 2

+2 A:

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return "".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.

Al 2009-08-14 08:58:39

It seems that string.ascii_letters only contains letters (duh) and not numbers. I also need the numbers...

Mark van Lent 2009-08-14 09:06:18

Adding string.digits would indeed solve the problem I just mentioned. :)

Mark van Lent 2009-08-14 09:08:08

Yes, I realised that when I went back to read your question. Note to self: learn to read!

Al 2009-08-14 09:21:53

Answer 3

+2 A:

>>> import re
>>> string = "Kl13@£$%[};'\""
>>> pattern = re.compile('\W')
>>> string = re.sub(pattern, '', string)
>>> print string
Kl13

DisplacedAussie 2009-08-14 09:01:22

Answer 4

+3 A:

You could try:

print ''.join(ch for ch in some_string if ch.isalnum())

ars 2009-08-14 09:02:28

Answer 5

+15 A:

I just timed some functions out of curiosity

$ python -m timeit -s \
     "import string" \
     "''.join(ch for ch in string.printable if ch.isalnum())" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \
    "import string" \
    "filter(str.isalnum, string.printable)"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]', '', string.printable)"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \
    "import re, string" \
    "re.sub('[\W_]+', '', string.printable)"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \
    "import re, string; pattern = re.compile('[\W_]+')" \
    "pattern.sub('', string.printable)" 
100000 loops, best of 3: 11.2 usec per loop

Otto Allmendinger 2009-08-14 10:03:32

Very interesting results: I would have expected the regular expressions to be slower. Interestingly, I tried this with one other option (`valid_characters = string.ascii_letters + string.digits` followed by `join(ch for ch in string.printable if ch in valid_characters)` and it was 6 microseconds quicker than the `isalnum()` option. Still much slower than the regexp though.

Al 2009-08-14 10:19:16

+1, measuring time is good! (but in the penultimate, do `pattern.sub('', string.printable)` instead -- silly to call re.sub when you have a RE object!-).

Alex Martelli 2009-08-14 15:05:37

thanks alex, I didn't read the docs carefully enough!

Otto Allmendinger 2009-08-14 16:15:15

On my machine, the \W, \W+ regular expressions still leave a _.

Nick Presta 2009-08-14 17:02:54

True, the documentation says this too. I'll update the listing.

Otto Allmendinger 2009-08-14 17:10:08

For the record: use `re.compile('[\W_]+', re.UNICODE)` to make it unicode safe.

Mark van Lent 2009-08-24 14:01:20

Answer 6

+3 A:

Use the str.translate() method.

Presuming you will be doing this often:

(1) Once, create a string containing all the characters you wish to delete:

delchars = ''.join(c for c in map(chr, range(256)) if not c.isalnum())

(2) Whenever you want to scrunch a string:

scrunched = s.translate(None, delchars)

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\junk>\python26\python -mtimeit -s"import string;d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable" "s.translate(None,d)"
100000 loops, best of 3: 2.04 usec per loop

C:\junk>\python26\python -mtimeit -s"import re,string;s=string.printable;r=re.compile(r'[\W_]+')" "r.sub('',s)"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern '[\W_]+' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:

C:\junk>\python26\python -c "import string; s = string.printable; print len(s),repr(s)"
100 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here's what happens if you give re.sub a bit more work to do:

C:\junk>\python26\python -mtimeit -s"d=''.join(c for c in map(chr,range(256)) if not c.isalnum());s='foo-'*25" "s.translate(None,d)"
1000000 loops, best of 3: 1.97 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;s='foo-'*25;r=re.compile(r'[\W_]+')" "r.sub('',s)"
10000 loops, best of 3: 26.4 usec per loop

John Machin 2009-08-15 00:33:48

Using translate is indeed quite a bit faster. Even when adding a for loop right before doing the substitution/translation (to make the setup costs weigh in less) still makes the translation roughly 17 times faster than the regexp on my machine. Good to know.

Mark van Lent 2009-08-18 13:58:30

ansaurus

tags:

views:

answers:

Stripping everything but alphanumeric chars from a string in Python

related questions