views:

532

answers:

6

What is the pythonic way to split a string before the occurrences of a given set of characters?

For example, I want to split 'TheLongAndWindingRoad' at any occurrence of an uppercase letter (possibly except the first), and obtain ['The', 'Long', 'And', 'Winding', 'Road'].

Edit: It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].

+16  A: 

Unfortunately it's not possible to split on a zero-width match in Python. But you can use re.findall instead:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']
Mark Byers
+3  A: 
import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))
Gabe
+8  A: 
>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']

>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']

If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"

gnibbler
+1: For first to get ABC working. I've also updated my answer now.
Mark Byers
>>> re.findall('[A-Z][a-z]*', "It's about 70% of the Economy") ----->['It', 'Economy']
ChristopheD
@ChristopheD. The OP doesn't say how to non-alpha characters should be treated.
gnibbler
@gnibbler: true, but this current regex way also `drops` all regular (just plain alpha) words that do not start with an uppercase letter. I doubt that that was the intention of the OP.
ChristopheD
+1  A: 

Alternative solution (if you dislike explicit regexes):

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s) if e.isupper()]

parts = []
for j in xrange(len(pos)):
    try:
        parts.append(s[pos[j]:pos[j+1]])
    except IndexError:
        parts.append(s[pos[j]:])

print parts
ChristopheD
+1  A: 

A variation on @ChristopheD 's solution

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]

print parts
pwdyson
+4  A: 

Here is an alternative regex solution. The problem can be reprased as "how do I insert a space before each uppercase letter, before doing the split":

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

This has the advantage of preserving all non-whitespace characters, which most other solutions do not.

Dave Kirby