ansaurus

Question

Python regex string to list of words (including words with hyphens)

Answer 1

+1 A:

You could use "[^\w-]+" instead.

Jens 2010-08-04 14:58:19

That would return `-this` but I know no better solution either. I feel there is no way without going over the result once more to remove the unwanted minuses.

Aaron Digulla 2010-08-04 15:01:00

Answer 2

A:

Yo can try with the NLTK library:

>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']

I found it here: http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html Hope it helps

fsouto 2010-08-04 15:02:35

Answer 3

+1 A:

If you don't need the leading empty string, you could use the pattern \w(?:[-\w]*\w)? for matching:

>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']

Note that it won't match words with apostrophes like won't.

KennyTM 2010-08-04 15:15:00

Answer 4

+1 A:

s = "-this is. A - sentence;one-word what's" re.findall("\w+-\w+|[\w']+",s)

result: ['this', 'is', 'A', 'sentence', 'one-word', "what's"]

make sure you notice that the correct ordering is to look for hyypenated words first!

pyInTheSky 2010-08-04 16:50:39

Answer 5

A:

Here my traditional "why to use regexp language when you can use Python" alternative:

import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
                 for word in s.replace(';','; ').split()
                 ])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""

Tony Veijalainen 2010-08-04 19:33:49

ansaurus

tags:

views:

answers:

Python regex string to list of words (including words with hyphens)

related questions