views:

93

answers:

4

This may be a silly question but...

Say you have a sentence like:

The quick brown fox

Or you might get a sentence like:

The quick brown fox jumped over the lazy dog

The simple regexp (\w*) finds the first word "The" and puts it in a group.

For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.

Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.

I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.

Thanks for any insight you may have.

Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.

>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']
+3  A: 

Why use a regex when string.split does the same thing?

>>> "The quick brown fox".split()
['The', 'quick', 'brown', 'fox']
Mark Rushakoff
Mainly because my use case is slightly more complex and it seems Regex would be the best fit for it.What I'm actually trying to do is get each instance of test1, test2, test3, etc. out of a string like such:>>> 1 0 5 test1 5 test2 5 test3 5 test4 5 test5where ("x testn") could be repeated any number of times, "x" is the number of characters in "testn", and the "1 0 " at the front is useless junk.
blah238
+1  A: 

Regular expressions can't group into unknown number of groups. But there is hope in your case. Look into the 'split' method, it should help in your case.

Vlad
+3  A: 

I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed on group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.

You could use split, but that only splits on one character value, not a class of characters like whitespace.

Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.

>>> import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.

>>> re.split('\s+', 'The   quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>
Mark Santesson
Thanks, that is more or less as I had concluded as well.
blah238
+4  A: 

You can also use the function findall in the module re

import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']
razpeitia