views:

326

answers:

3

I have a multi-line string like this:

"...Togo...Togo...Togo...ACTIVE..."

I want to get everything between the third 'Togo' and 'ACTIVE' and the remainder of the string. I am unable to create a regular expression that can do this. If I try something like

reg = "(Togo^[Togo]*?)(ACTIVE.*)"

nothing is captured (the first and last parentheses are needed for capturing groups).

+1  A: 
reg = "Togo.*Togo.*Togo(.*)ACTIVE"

Alternatively, if you want to match the string between the last occurrence of Togo and the following occurence of ACTIVE, and the number of Togo occurences is not necessarily three, try this:

reg = "Togo(([^T]|T[^o]|To[^g]|Tog[^o])*T?.?.?)ACTIVE"
Igor ostrovsky
+1  A: 

This matches just the desired parts:

.*(Togo.*?)(ACTIVE.*)

The leading .* is greedy, so the following Togo matches at the last possible place. The captured part starts at the last Togo.

In your expression ^[Togo]*? doesn't do the right thing. ^ tries to match the beginning of a line and [Togo] matches any of the characters T, o or g. Even [^Togo] wouldn't work since this just matches any character that is not T, o or g.

sth
Duh... much simpler than my attempt.
Igor ostrovsky
In general this seems to be the best suggestion, but in my case it takes too much time. Still, I think this is the best approach if it's fast enough.
Tony
+1  A: 
"(Togo(?:(?!Togo).)*)(ACTIVE.*)"

The square brackets in your regex form a character class that matches one of the characters 'T', 'o', or 'g'. The caret ('^') matches the beginning of the input if it's not in a character class, and it can be used inside the square brackets to invert the character class.

In my regex, after matching the word "Togo" I match one character at a time, but only after I check that it isn't the start of another instance of "Togo". (?!Togo) is called a negative lookahead.

Alan Moore