ansaurus

Question

Python: removing characters except digits from string

Answer 1

+19 A:

Use re.sub, like so:

>>> import re
>>> re.sub("\D", "", "aas30dsa20")
'3020'

\D matches any non-digit character so, the code above, is essentially replacing every non-digit character for the empty string.

Or you can use filter, like so (in Python 2k):

>>> filter(lambda x: x.isdigit(), "aas30dsa20")
'3020'

Since in Python 3k, filter returns an iterator instead of a list, you can use the following instead:

>>> ''.join(filter(lambda x: x.isdigit(), "aas30dsa20"))
'3020'

JG 2009-09-20 12:18:48

re is evil in such simple task, second one is the best I think, cause 'is...' methods are the fastest for strings.

f0b0s 2009-09-20 12:25:48

your filter example is limited to py2k

SilentGhost 2009-09-20 12:29:25

@f0b0s-iu9-info: did you timed it? on my machine (py3k) re is twice as fast than filter with `isdigit`, generator with `isdigt` is halfway between them

SilentGhost 2009-09-20 12:35:47

@SilentGhost: Thanks, I was using IDLE from py2k. It's fixed now.

JG 2009-09-20 12:35:52

Answer 2

+1 A:

Use a generator expression:

>>> s = "foo200bar"
>>> new_s = "".join(i for i in s if i in "0123456789")

bayer 2009-09-20 12:21:49

Works, but man is that ugly and probably rather inefficient.

Chazadanga 2009-09-20 13:28:20

Answer 3

+1 A:

Ugly but works:

>>> s
'aaa12333bb445bb54b5b52'
>>> a = ''.join(filter(lambda x : x.isdigit(), s))
>>> a
'1233344554552'
>>>

m3rLinEz 2009-09-20 12:23:03

why do you do `list(s)`?

SilentGhost 2009-09-20 12:23:45

@SilentGhost it's my misunderstanding. had it corrected thanks :)

m3rLinEz 2009-09-20 12:26:37

Answer 4

+3 A:

along the lines of bayer's answer:

''.join(i for i in s if i.isdigit())

SilentGhost 2009-09-20 12:23:17

Answer 5

+4 A:

You can use filter:

filter(lambda x: x.isdigit(), "dasdasd2313dsa")

On python3.0 you have to join this (kinda ugly :( )

''.join(filter(lambda x: x.isdigit(), "dasdasd2313dsa"))

freiksenet 2009-09-20 12:24:05

only in py2k, in py3k it returns a generator

SilentGhost 2009-09-20 12:33:28

Answer 6

+10 A:

s=''.join(i for i in s if i.isdigit())

Another generator variant.

f0b0s 2009-09-20 12:24:18

My favorite. +1

lost-theory 2009-09-20 12:27:38

Answer 7

+10 A:

In Python 2.*, by far the fastest approach is the .translate method:

>>> x='aaa12333bb445bb54b5b52'
>>> import string
>>> all=string.maketrans('','')
>>> nodigs=all.translate(all, string.digits)
>>> x.translate(all, nodigs)
'1233344554552'
>>>

string.maketrans makes a translation table (a string of length 256) which in this case is the same as ''.join(chr(x) for x in range(256)) (just faster to make;-). .translate applies the translation table (which here is irrelevant since all essentially means identity) AND deletes characters present in the second argument -- the key part.

.translate works very differently on Unicode strings (and strings in Python 3 -- I do wish questions specified which major-release of Python is of interest!) -- not quite this simple, not quite this fast, though still quite usable.

Back to 2.*, the performance difference is impressive...:

$ python -mtimeit -s'import string; all=string.maketrans("", ""); nodig=all.translate(all, string.digits); x="aaa12333bb445bb54b5b52"' 'x.translate(all, nodig)'
1000000 loops, best of 3: 1.04 usec per loop
$ python -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'100000 loops, best of 3: 7.9 usec per loop

Speeding things up by 7-8 times is hardly peanuts, so the translate method is well worth knowing and using. The other popular non-RE approach...:

$ python -mtimeit -s'x="aaa12333bb445bb54b5b52"' '"".join(i for i in x if i.isdigit())'
100000 loops, best of 3: 11.5 usec per loop

is 50% slower than RE, so the .translate approach beats it by over an order of magnitude.

In Python 3, or for Unicode, you need to pass .translate a mapping (with ordinals, not characters directly, as keys) that returns None for what you want to delete. Here's a convenient way to express this for deletion of "everything but" a few characters:

import string

class Del:
  def __init__(self, keep=string.digits):
    self.comp = dict((ord(c),c) for c in keep)
  def __getitem__(self, k):
    return self.comp.get(k)

DD = Del()

x='aaa12333bb445bb54b5b52'
x.translate(DD)

also emits '1233344554552'. However, putting this in xx.py we have...:

$ python3.1 -mtimeit -s'import re;  x="aaa12333bb445bb54b5b52"' 're.sub(r"\D", "", x)'
100000 loops, best of 3: 8.43 usec per loop
$ python3.1 -mtimeit -s'import xx; x="aaa12333bb445bb54b5b52"' 'x.translate(xx.DD)'
10000 loops, best of 3: 24.3 usec per loop

...which shows the performance advantage disappears, for this kind of "deletion" tasks, and becomes a performance decrease.

Alex Martelli 2009-09-20 16:37:19

comprehensive, especial the Python3.x(Unicode) part. maybe Unicode is more powerful in a much bigger domain, for example: removing characters except Chinese characters from Unicode string

sunqiang 2009-09-21 01:54:37

@sunqiang, yes, absolutely -- there's a reason Py3k has gone to Unicode as THE text string type, instead of byte strings as in Py2 -- same reason Java and C# have always had the same "string means unicode" meme... some overhead, maybe, but MUCH better support for just about anything but English!-).

Alex Martelli 2009-09-21 02:07:27

ansaurus

tags:

views:

answers:

Python: removing characters except digits from string

related questions