ansaurus

Question

Splitting a string into words and punctuation

Answer 1

A:

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

Filip Ekberg 2008-12-14 23:34:49

Answer 2

A:

Thanks Filip, I'll look into the regex library.

On the second point, maybe I should have used a better example.

If I had the string:

d = "Hello, I'm a string!"

I'd want the output:

['Hello', ',', 'I'm', 'a', 'string', "!"]

So, I want to separate all different sorts of punctuation from the alphanumerics, splitting at the whitespace.

David A 2008-12-14 23:56:34

please edit your original question if you have additional information

hop 2008-12-15 00:43:03

Answer 3

A:

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

Svante 2008-12-15 00:25:08

only it doesn't because re.split will not work with r'\b'...

hop 2008-12-15 01:09:10

What the hell? Is that a bug in re.split? In Perl, `split /\b\s*/` works without any problem.

Svante 2008-12-15 01:29:34

it's kind of documented that re.split() won't split on empty matches... so, no, not /really/ a bug.

hop 2008-12-15 01:51:26

"kind of documented"? Even if it is really documented, it is still not helpful in any way, so I guess it is, in fact, a bug-redeclared-feature.

Svante 2008-12-15 02:08:28

maybe. i don't know the rationale behind it. you should have checked whether it worked in any case! i cannot remove the downvote anymore, but please consider rewording the passive-aggressive edit -- doesn't help anyone.

hop 2008-12-15 09:16:08

Answer 4

A:

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

le dorfier 2008-12-15 00:34:08

Answer 5

A:

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

monkut 2008-12-15 01:05:11

i have not profiled this, but i guess the main problem is with the char-by-char concatenation of word. i'd instead use an index and slices.

hop 2008-12-15 10:24:20

With tricks i can shave 50% off the execution time of your solution. my solution with re.findall() is still twice as fast.

hop 2008-12-15 12:17:54

Answer 6

A:

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

Chris Cameron 2008-12-15 01:30:32

Answer 7

+6 A:

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
This will not work with (single) quotes in the string.
Put any additional punctuation marks you want to use in the right half of the regular expression.
Anything not explicitely mentioned in the re is silently dropped.
compiling the regular expression beforehand should make this considerably faster

hop 2008-12-15 01:53:18

Thanks, works perfectly.

David A 2008-12-15 20:42:08

ansaurus

tags:

views:

answers:

Splitting a string into words and punctuation

related questions