views:

176

answers:

4

string.split() returns a list instance. Is there a version that returns a generator instead? Are there any reasons against having a generator version?

+1  A: 

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()
Ignacio Vazquez-Abrams
@Ignacio: The example in docs uses a list of integers to illustrate the use of `takeWhile`. What would be a good `predicate` for splitting a string into words (default `split`) using `takeWhile()`?
Manoj Govindan
Look for presence in `string.whitespace`.
Ignacio Vazquez-Abrams
The separator can have multiple characters, `'abc<def<>ghi<><>lmn'.split('<>') == ['abc<def', 'ghi', '', 'lmn']`
KennyTM
@Ignacio: Can you add an example to your answer?
Manoj Govindan
Easy to write, but *many* orders of magnitude slower. This is an operation that really should be implemented in native code.
Glenn Maynard
@KennyTM: Sure, it *can* be. But it doesn't always need to be, and it usually is not.
Ignacio Vazquez-Abrams
@Glenn: Is string type's `split` implemented in native code? I checked `string.split` and found it dispatches to `s.split` where `s` is the first argument to `string.split`.
Manoj Govindan
@Manoj: `str` and `unicode` are implemented in native code, so yes.
Ignacio Vazquez-Abrams
@Ignacio: Got it. Is a native generator version possible at all?
Manoj Govindan
Probably. You may need to implement a new type for the generator and fill its `tp_iternext` member, but I don't know all the details.
Ignacio Vazquez-Abrams
It's a lot more work, and I doubt the value of this to begin with, but anything you can do in Python you can do natively if you really want to.
Glenn Maynard
+2  A: 

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)
Dave Webb
You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once.
Glenn Maynard
@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with `id()` put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it.
Dave Webb
Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory).
Scott Griffiths
@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions.
Glenn Maynard
A: 

You can build one easily using str.split itself with a limit:

def isplit(s, sep=None):
    while s:
        parts = s.split(sep, 1)
        if len(parts) == 2:
            s = parts[1]
        else:
            s = ''
        yield parts[0]

This way, you don't have to replicate strip()'s functionality and behaviour (e.g. when sep=None) and it depends on its possibly fast native implementation. I assume that string.split will stop scanning the string for separators once it has enough 'parts'.

As Glenn Maynard points out, this scales poorly for large strings (O(n^2)). I've confirmed this through 'timit' tests.

Ivo van der Wijk
This is O(n^2), making it catastrophically slow when the string has a lot of words, eg. `"abcd " * 1000000`. (I explained this already to someone else who gave the same solution--he deleted the answer, so now I get to repeat myself...)
Glenn Maynard
@Glenn: while it's a pity that such a clear code doesn't have good complexity, I'd think that for strings of typical length it would do just fine. What is the length of strings you're usually splitting?
SilentGhost
Also, you could improve performance and code by using `partition` (which doesn't allow for `None` separator): `while s: a, _, s = s.partition(sep);yield a`
SilentGhost
@SilentGhost: While I'm not sure what real-world use this has to begin with, it's even harder to think of a practical use for this if you're not dealing with very large strings.
Glenn Maynard
A: 

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Bernd Petersohn