views:

1151

answers:

4

I'm writing a Python function to split text into words, ignoring specified punctuation. Here is some working code. I'm not convinced that constructing strings out of lists (buf = [] in the code) is efficient though. Does anyone have a suggestion for a better way to do this?

def getwords(text, splitchars=' \t|!?.;:"'):
    """
    Generator to get words in text by splitting text along specified splitchars
    and stripping out the splitchars::

      >>> list(getwords('this is some text.'))
      ['this', 'is', 'some', 'text']
      >>> list(getwords('and/or'))
      ['and', 'or']
      >>> list(getwords('one||two'))
      ['one', 'two']
      >>> list(getwords(u'hola unicode!'))
      [u'hola', u'unicode']
    """
    splitchars = set(splitchars)
    buf = []
    for char in text:
        if char not in splitchars:
            buf.append(char)
        else:
            if buf:
                yield ''.join(buf)
                buf = []
    # All done. Yield last word.
    if buf:
        yield ''.join(buf)
+4  A: 

You don't want to use re.split?

import re
re.split("[,; ]+", "coucou1 ,   coucou2;coucou3")
poulejapon
Didn't think of that at all. Will consider it. Thanks!
Jace
+3  A: 

http://www.skymind.com/~ocrow/python_string/ talks about several ways of concatenating strings in Python and assesses their performance as well.

Vijay Dev
This was what I needed. Thanks. cStringIO appears the best choice for my use case.
Jace
Uh oh. cStringIO can't handle unicode strings.
Jace
For what it's worth: I hacked on that testcase until it ran on my Python 2.5 install, and found Method 6 (feed ''.join a list comprehension) to be consistently fastest. 6 with generator expressions turned out *slower* but still second-fastest.
kquinn
In order from fastest to slowest, the methods ended up being 6, 7, 4, 1, 5, 3, 2. (7 is 6 with the brackets dropped to make it a generator expression not list comprehension). I was unable to measure memory use.
kquinn
+1  A: 

You can split the input using re.split():

>>> splitchars=' \t|!?.;:"'
>>> re.split("[%s]" % splitchars, "one\ttwo|three?four")
['one', 'two', 'three', 'four']
>>>

EDIT: If your splitchars may contain special chars like ] or ^, you can use re.escpae()

>>> re.escape(splitchars)
'\\ \\\t\\|\\!\\?\\.\\;\\:\\"'
>>> re.split("[%s]" % re.escape(splitchars), "one\ttwo|three?four")
['one', 'two', 'three', 'four']
>>>
gimel
That one's risky. What if splitchars starts with a '^' or contains a ']'?
Jace
Escape them. See edit.
gimel
+2  A: 

You can use re.split

re.split('[\s|!\?\.;:"]', text)

However if the text is very large the resulting array may be consuming too much memory. Then you may consider re.finditer:

import re
def getwords(text, splitchars=' \t|!?.;:"'):
  words_iter = re.finditer(
    "([%s]+)" % "".join([("^" + c) for c in splitchars]),
    text)
  for word in words_iter:
    yield word.group()

# a quick test
s = "a:b cc? def...a||"
words = [x for x in getwords(s)]
assert ["a", "b", "cc", "def", "a"] == words, words
Jiayao Yu