string.split()
returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?
views:
176answers:
4No, but it should be easy enough to write one using itertools.takewhile()
.
EDIT:
Very simple, half-broken implementation:
import itertools
import string
def isplitwords(s):
i = iter(s)
while True:
r = []
for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
r.append(c)
else:
if r:
yield ''.join(r)
continue
else:
raise StopIteration()
I don't see any obvious benefit to a generator version of split()
. The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.
If you wanted to write one it would be fairly easy though:
import string
def gsplit(s,sep=string.whitespace):
word = []
for c in s:
if c in sep:
if word:
yield "".join(word)
word = []
else:
word.append(c)
if word:
yield "".join(word)
You can build one easily using str.split itself with a limit:
def isplit(s, sep=None):
while s:
parts = s.split(sep, 1)
if len(parts) == 2:
s = parts[1]
else:
s = ''
yield parts[0]
This way, you don't have to replicate strip()'s functionality and behaviour (e.g. when sep=None) and it depends on its possibly fast native implementation. I assume that string.split will stop scanning the string for separators once it has enough 'parts'.
As Glenn Maynard points out, this scales poorly for large strings (O(n^2)). I've confirmed this through 'timit' tests.
This is generator version of split()
implemented via re.search()
that does not have the problem of allocating too many substrings.
import re
def itersplit(s, sep=None):
exp = re.compile(r'\s+' if sep is None else re.escape(sep))
pos = 0
while True:
m = exp.search(s, pos)
if not m:
if pos < len(s) or sep is not None:
yield s[pos:]
break
if pos < m.start() or sep is not None:
yield s[pos:m.start()]
pos = m.end()
sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["
assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')
EDIT: Corrected handling of surrounding whitespace if no separator chars are given.