views:

541

answers:

5

I'd like to split strings like these

'foofo21' 'bar432' 'foobar12345'

into

['foofo', '21'] ['bar', '432'] ['foobar', '12345']

Does somebody know an easy and simple way to do this in python?

+6  A: 

I would approach this by using re.match in the following way:

match = re.match(r"([a-z]+)([0-9]+)", 'foofo21', re.I)
if match:
    items = match.groups()
    # items is ("foo", "21")
Evan Fosmark
you probably want \w instead of [a-z] and \d instead of [0-9]
Dan
@Dan:Using \w is a poor choice as it matches all alphanumeric characters, not just a-z. So, the entire string would be caught in the first group.
Evan Fosmark
Not if you match it ungreedy as I do in my answer.
PEZ
What about upper case?
Bernard
@Bernard, notice the `re.I` at the end. That makes case a non-issue.
Evan Fosmark
You might get some false positives using this method. If you tried m = r.match("abc123def"), then m.groups() would get you ('abc', '123'). That's because re.match() matches from the beginning of a string but doesn't need to match the entire string.
eksortso
If that's a concern, you can tack '\b' (IIRC) at the end, to specify that the match must end at a word boundary (or '$' to match the end of the string).
Jeff Shannon
+5  A: 
>>> r = re.compile("([a-zA-Z]+)([0-9]+)")
>>> m = r.match("foobar12345")
>>> m.group(1)
'foobar'
>>> m.group(2)
'12345'

So, if you have a list of strings with that format:

import re
r = re.compile("([a-zA-Z]+)([0-9]+)")
strings = ['foofo21', 'bar432', 'foobar12345']
print [r.match(string).groups() for string in strings]

Output:

[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
Federico Ramponi
+1  A: 

I'm always the one to bring up findall() =)

>>> strings = ['foofo21', 'bar432', 'foobar12345']
>>> [re.findall(r'(\w+?)(\d+)', s)[0] for s in strings]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]

Note that I'm using a simpler (less to type) regex than most of the previous answers.

PEZ
r'\w' matches '_'. I don't see '_' in the question.
J.F. Sebastian
I don't see A-Z in the question. It says "text and numbers".
PEZ
@PEZ: If you allow any text except numbers then your regexp should be r'(\D+)(\d+)'.
J.F. Sebastian
\w makes the most sense
PEZ
+3  A: 

Yet Another Option:

>>> [re.split(r'(\d+)', s) for s in ('foofo21', 'bar432', 'foobar12345')]
[['foofo', '21', ''], ['bar', '432', ''], ['foobar', '12345', '']]
J.F. Sebastian
Neat. Or even: [re.split(r'(\d+)', s)[0:2] for s in ...] getting rid of that extra empty string. Note though that compared with \w this is equivalent to [^|\d].
PEZ
@PEZ: There may be more than one pair and an empty string may be at the begining of the list. You could remove empty strings with `[filter(None, re.split(r'(\d+)', s)) for s in ('foofo21','a1')]`
J.F. Sebastian
+1  A: 
>>> def mysplit(s):
...     head = s.rstrip('0123456789')
...     tail = s[len(head):]
...     return head, tail
... 
>>> [mysplit(s) for s in ['foofo21', 'bar432', 'foobar12345']]
[('foofo', '21'), ('bar', '432'), ('foobar', '12345')]
>>> 
Mike