views:

402

answers:

1

i'm working on a regex that will extract retweet keywords and user names from tweets. here's an example, with a rather terrible regex to do the job:

tweet='foobar RT@one, @two: @three barfoo'
m=re.search(r'(RT|retweet|from|via)\b\W*@(\w+)\b\W*@(\w+)\b\W*@(\w+)\b\W*',tweet)
m.groups()
('RT', 'one', 'two', 'three')

what i'd like is to condense the repeated \b\W*@(\w+)\b\W* patterns and make them of a variable number, so that if @four were added after @three, it would also be extracted. i've tried many permutations to repeat this with a + unsuccessfully.

i'd also like this to work for something like

tweet='foobar RT@one, RT @two: RT @three barfoo';

which can be achieved with a re.finditer if the patterns don't overlap. (i have a version where the patterns do overlap, and so only the first RT gets picked up.)

any help is greatly appreciated. thanks.

+2  A: 

Try

(RT|retweet|from|via)(?:\b\W*@(\w+))+'

Enclosing the \b\W*@(\w+) in '(?:...)` allows you to group the terms for repetition without capturing the aggregate.

I'm not sure I'm following the second part of your question, but I think you may be looking for something involving a construct like:

(?:(?!RT|@).)

which will match any character that isn't an "@" or the start of "RT", again without capturing it.

In that case, how about:

(RT|retweet|from|via)((?:\b\W*@\w+)+)

and then post process

re.split(r'@(\w+)' ,m.groups()[1])

To get the individual handles?

MarkusQ
thanks for the quick reply!unfortunately that doesn't seem to work, unless i've mistyped something: tweet='foobar RT@one, @two: @three barfoo' m=re.search(r'(RT|retweet|from|via)(?:\b\W*@(\w+))+',tweet) m.groups() ('RT', 'three')but i will read up on (?:...). thanks.
jhofman
thanks markus. i essentially ended up going with a method similar to this, but was bothered by not being able to come up with a one-regex solution. appreciate it.
jhofman