ansaurus

Question

Is there a generator version of `string.split()` in Python?

Answer 1

+1 A:

No, but it should be easy enough to write one using itertools.takewhile().

EDIT:

Very simple, half-broken implementation:

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

Ignacio Vazquez-Abrams 2010-10-05 08:33:22

@Ignacio: The example in docs uses a list of integers to illustrate the use of `takeWhile`. What would be a good `predicate` for splitting a string into words (default `split`) using `takeWhile()`?

Manoj Govindan 2010-10-05 08:36:00

Look for presence in `string.whitespace`.

Ignacio Vazquez-Abrams 2010-10-05 08:37:25

The separator can have multiple characters, `'abc<def<>ghi<><>lmn'.split('<>') == ['abc<def', 'ghi', '', 'lmn']`

KennyTM 2010-10-05 08:42:06

@Ignacio: Can you add an example to your answer?

Manoj Govindan 2010-10-05 08:43:41

Easy to write, but *many* orders of magnitude slower. This is an operation that really should be implemented in native code.

Glenn Maynard 2010-10-05 08:43:47

@KennyTM: Sure, it *can* be. But it doesn't always need to be, and it usually is not.

Ignacio Vazquez-Abrams 2010-10-05 08:44:08

@Glenn: Is string type's `split` implemented in native code? I checked `string.split` and found it dispatches to `s.split` where `s` is the first argument to `string.split`.

Manoj Govindan 2010-10-05 09:09:57

@Manoj: `str` and `unicode` are implemented in native code, so yes.

Ignacio Vazquez-Abrams 2010-10-05 09:13:28

@Ignacio: Got it. Is a native generator version possible at all?

Manoj Govindan 2010-10-05 09:27:35

Probably. You may need to implement a new type for the generator and fill its `tp_iternext` member, but I don't know all the details.

Ignacio Vazquez-Abrams 2010-10-05 09:33:17

It's a lot more work, and I doubt the value of this to begin with, but anything you can do in Python you can do natively if you really want to.

Glenn Maynard 2010-10-05 09:52:03

Answer 2

+2 A:

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

If you wanted to write one it would be fairly easy though:

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

Dave Webb 2010-10-05 08:53:00

You'd halve the memory used, by not having to store a second copy of the string in each resulting part, plus the array and object overhead (which is typically more than the strings themselves). That generally doesn't matter, though (if you're splitting strings so large that this matters, you're probably doing something wrong), and even a native C generator implementation would always be significantly slower than doing it all at once.

Glenn Maynard 2010-10-05 08:58:54

@Glenn Maynard - I just realised that. I for some reason I originally the generator would store a copy of the string rather than a reference. A quick check with `id()` put me right. And obviously as strings are immutable you don't need to worry about someone changing the original string while you're iterating over it.

Dave Webb 2010-10-05 09:02:28

Isn't the main point in using a generator not the memory usage, but that you could save yourself having to split the whole string if you wanted to exit early? (That's not a comment on your particular solution, I was just surprised by the discussion about memory).

Scott Griffiths 2010-10-05 16:15:47

@Scott: It's hard to think of a case where that's really a win--where 1: you want to stop splitting partway through, 2: you don't know how many words you're splitting in advance, 3: you have a large enough string for it to matter, and 4: you consistently stop early enough for it to be a significant win over str.split. That's a very narrow set of conditions.

Glenn Maynard 2010-10-05 20:35:46

Answer 3

A:

You can build one easily using str.split itself with a limit:

def isplit(s, sep=None):
    while s:
        parts = s.split(sep, 1)
        if len(parts) == 2:
            s = parts[1]
        else:
            s = ''
        yield parts[0]

This way, you don't have to replicate strip()'s functionality and behaviour (e.g. when sep=None) and it depends on its possibly fast native implementation. I assume that string.split will stop scanning the string for separators once it has enough 'parts'.

As Glenn Maynard points out, this scales poorly for large strings (O(n^2)). I've confirmed this through 'timit' tests.

Ivo van der Wijk 2010-10-05 10:33:51

This is O(n^2), making it catastrophically slow when the string has a lot of words, eg. `"abcd " * 1000000`. (I explained this already to someone else who gave the same solution--he deleted the answer, so now I get to repeat myself...)

Glenn Maynard 2010-10-05 11:30:51

@Glenn: while it's a pity that such a clear code doesn't have good complexity, I'd think that for strings of typical length it would do just fine. What is the length of strings you're usually splitting?

SilentGhost 2010-10-05 12:03:42

Also, you could improve performance and code by using `partition` (which doesn't allow for `None` separator): `while s: a, _, s = s.partition(sep);yield a`

SilentGhost 2010-10-05 12:12:06

@SilentGhost: While I'm not sure what real-world use this has to begin with, it's even harder to think of a practical use for this if you're not dealing with very large strings.

Glenn Maynard 2010-10-05 20:39:29

Answer 4

A:

This is generator version of split() implemented via re.search() that does not have the problem of allocating too many substrings.

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT: Corrected handling of surrounding whitespace if no separator chars are given.

Bernd Petersohn 2010-10-05 15:47:59

ansaurus

tags:

views:

answers:

Is there a generator version of `string.split()` in Python?

related questions