views:

841

answers:

5

Hi,

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing some of the tokens.

Example:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks!

+7  A: 

How about

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
Jonathan Feinberg
elegant and easily expandable (think `(\s+|\w+|\S+)`).
hop
+3  A: 
>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']
Denis Otkidach
+2  A: 

the re module provides this functionality:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

Tim Pietzcker
not using the OP's string masks the fact that the empty string is included as the first element of the returned list.
hop
Thanks. I edited my post accordingly (although in this case, the OP's spec ("want to preserve whitespace") and his example were contradictory).
Tim Pietzcker
No, it wasn't... there was one example of the current behaviour, and another of the desired one.
fortran
Oh, sorry. My bad.
Tim Pietzcker
A: 

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
     if delim_group != (char in delims):
      delim_group ^= True
      yield s[start:index]
      start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD

fortran
no need regex or creating your own wheels if you have python 2.5 onwards.. see my answer.
+3  A: 

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
jcdyer