views:

4391

answers:

8

Weird - I think what I want to do is a fairly common task but I've found no reference on the web. I have text, with punctuation, and I want an array of the words. i.e - "Hey, you - what are you doing here!?" should be ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']. But python's split() only works with one argument... so I have all words with punctuation after I split with white space. Any ideas?

Thanks

+14  A: 

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r'\w+', DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
RichieHindle
Oo, nice approach.
Paolo Bergantino
Thanks.Still interested, though - how can I implement the algorithm used in this module? And why does it not appear in the string module?
ooboo
I don't know why the string module doesn't have a multi-character split. Maybe it's considered complex enough to be in the realm of regular expressions. As for "how can I implement the algorithm", I'm not sure what you mean... it's there in the re module - just use it.
RichieHindle
No, I mean - how does this module works? It's not straightforward at all
ooboo
Regular expressions can be daunting at first, but are very powerful. The regular expression '\w+' means "a word character (a-z etc.) repeated one or more times". There's a HOWTO on Python regular expressions here: http://www.amk.ca/python/howto/regex/
RichieHindle
I got that - I don't mean how to use the re module (it's pretty complicated in itself) but how is it implemented? split() is rather straightforward to program manually, this is much more difficult...
ooboo
You want to know how the re module itself works? I can't help you with that I'm afraid - I've never looked at its innards, and my Computer Science degree was a very long time ago. 8-)
RichieHindle
I'm doing my CS1 so I've got a long way to go... It seems very difficult, at first glance, actually, harder than TSP etc. :)
ooboo
A: 

You want Python's RegEx module's findall() method:

http://www.regular-expressions.info/python.html

Example

Tyson
Python's "RegEx" module? Python once had a "regex" module which was provided but deprecated up until 2.5 when it vanished.
John Machin
"re"? As in the one included in 2.6.2? http://docs.python.org/library/re.html The one included in the bleeding edge 3.2a0? http://docs.python.org/dev/py3k/library/re.html Something tells me it's not deprecated and is, in fact, the definitive RegEx module for Python.
Tyson
+8  A: 

re.split()

re.split(pattern, string[, maxsplit=0])

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list. (Incompatibility note: in the original Python 1.5 release, maxsplit was ignored. This has been fixed in later releases.)

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
gimel
+1  A: 

try this:

import re

phrase = "Hey, you - what are you doing here!?"
matches = re.findall('\w+', phrase)
print matches

this will print ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Corey Goldberg
+1  A: 

Another way to achieve this is to use the Natural Language Tool Kit (nltk).

import nltk
data= "Hey, you - what are you doing here!?"
word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+')
print word_tokens

This prints: ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

The biggest drawback of this method is that you need to install the nltk package.

The benefits are that you can do a lot of fun stuff with the rest of the nltk package once you get your tokens.

tgray
A: 

Use list comprehensions for this stuff...it seems easier

data= "Hey, you - what are you doing here!?"
tokens = [c for c in data if c not in (',', ' ', '-', '!', '?')]

I find this easier to comprehend (read..maintain) than using regexp, simply because I am not that good at regexp...which is the case with most of us :) . Also if you know what set of separators you might be using, you can keep them in a set. With a very huge set, this might be slower...but the 're' module is slow as well.

OK...this is WRONG!!!!, it works only if you want to get characters out...not words. My mistake.
for using list comprehension, use what ghostdog74 has given.
+4  A: 

another way, without regex

import string
punc = string.punctuation
thestring =  "Hey, you - what are you doing here!?"
s=list(thestring)
''.join([ o for o in s if not o in string.punctuation ]).split()
ghostdog74
A: 

Kinda late answer :), but I had a similar dilemma and didn't want to use 're' module.

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

print my_split('1111  2222 3333;4444,5555;6666', [' ', ';', ','])
['1111', '', '2222', '3333', '4444', '5555', '6666']
pprzemek