views:

220

answers:

2

How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

For example, I have this text:

outer-start some text inner-start text-that-i-want inner-end some more text outer-end

In this case, I want text-that-i-want because it is between inner-start and inner-end, which themselves are between outer-start and outer-end.

If I have

some text inner-start text-that-i-want inner-end some more text outer-end

then I don't want text-that-i-want, because although it is between inner-start and inner-end, there is no outer-start enclosing these strings.

Likewise, if I have

outer-start some text text-that-i-want inner-end some more text outer-end

then again, I don't want text-that-i-want, because there is no enclosing inner-start, although there are enclosing outer-start and outer-end strings.

Assume that outer-start, inner-start, inner-end and outer-end will only ever be used for the purposes of enclosing/delimiting.

I reckon that I can do this by doing a two pass regular expression match, i.e. looking for any data between outer-start and outer-end, and then within that data looking for any text between inner-start and inner-end (if indeed those strings exist), but I would like to know if it can be done in one go.

+4  A: 

I imagine you can do something like:


outer-start .*? inner-start (.*?) inner-end .*? outer-end
Ben McCann
Looks like Brian beat me to posting this solution. The reason I included question marks was to save you from trouble with a greedy regex. You'll likely want to include them.
Ben McCann
+5  A: 
/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

You need to use minimal matching to keep the regexp engine from malfunctioning when there are multiple "texts-that-i-want"s, for example:

"outer-start some text inner-start first-text-that-i-want inner-end some more text outer-end outer-start some text inner-start second-text-that-i-want inner-end some more text outer-end"

Without minimal matching, you'll get the puzzling single match, "second-text-that-i-want".

The .*? means "eat zero or more characters, but only as many as you need to to make the rest of the expression match. With the ?, a regexp engine will eat as many characters as it can as long as the rest of the expression matches.

Wayne Conrad
As a matter of fact, with greedy matching you'd get "first-text-that-i-want inner-end some more text outer-end outer-start some text inner-start second-text-that-i-want" in the capture group.
Michał Marczyk
Michal: Nope, the first (and non-grouped) `.*` eats most of the text you quoted.
Roger Pate
Ouch... right. My bad, thanks for the correction. In fact, this is good reason to remove my answer and +1 this one.
Michał Marczyk
@Wayne: Why don't you edit to include the lazy version (.*?) in the pattern at the top? As your answer stands, you've got a good explanation of why .*? is to be preferred over .*, yet use .* in the high visibility example. :-)
Michał Marczyk
@Michael: Oh, that was careless of me. I tested both the good and bad regex, but when it I posted the answer, I copied and pasted the bad one. Bad programmer, no cookie! Thanks for watching my back.
Wayne Conrad
Sure, at least I can give myself a tiny cookie for that. You still get the scrumptious one for sound lazy-matching-fu. ;-)
Michał Marczyk
Wayne, thanks a lot for the answer. I actually have a follow on question - is there a way I can check to ensure that [the text between outer-start and inner-start] does not contain a specific string? i.e. I only want to return a match if this specific string is not found between outer-start and inner-start. I can open a new question if you think it's complicated...
Shoko
@Shoko: In general, just replace the `.*?` between outer-start and inner-start with a pattern which will exclude the specific string that you wish to exclude. That might be tricky or not, depending on the string in question. If in doubt, do ask a separate question.
Michał Marczyk