views:

71

answers:

5

Hi all,

Suppose I have the following test string:

Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop

where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....

What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.

Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop

I precise that I'd like to do this only using regex and as far as possible in a single pass.

Any suggestions are welcome

Thanks'

A: 

I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.

PolyThinker
Thanks' PolyThinker, but I can handle it in two steps as you suggest, but I wonder if it would be possible in a single pass...
Jerome
A: 
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
ghostdog74
Thanks' ghostdog but I really need regex only...
Jerome
A: 

Something like this, maybe:

(?<=Start(?:.Get)*)Get(?=.Stop)

That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.

Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.

Peter Boughton
AFAIK, the `{0,99}` trick only works in Java (i.e., it supports bounded variable-length lookbehind). But you're in luck: the OP is using .NET, one of the two flavors that support *unbounded* lookbehind (the other being JGSoft).
Alan Moore
+1  A: 

With Perl, i'd do :

my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;

output:

Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop

You should adapt to your regex flavour.

M42
+1  A: 
Get(?=(?:(?!Get|Start|Stop).)*Stop)

I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.

Alan Moore
That's exactly what I needed! Thanks' Alan Moore.
Jerome
Hi Alan, I've tried this variant of your solution: Get(?=(?:(?!Get).)*Stop) and it seems to be working too. What is the need for the alternation (Get|Start|Stop) since (assuming delimiters are correctly balanced as you mention) the requirement is to have no other Get between the searched Get and the suffix ?
Jerome
`Start` is to prevent matching a `Get` that's not between delimiters, like `Get_Start_Stop`. As for `Stop`, suppose there's a whole bunch of text after the last `Stop`. You don't want the `.*` to go all the way to the end, only to have to backtrack most of that distance to match the `Stop`. Lookaheads can be slippery; it's worth a little extra care to make sure they only look as far ahead as you need them to.
Alan Moore
I understand. Thanks' again!
Jerome