tags:

views:

1729

answers:

3

I'm stuck on a RegEx problem that's seemingly very simple and yet I can't get it working.

Suppose I have input like this:

Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text OPTIONAL_THING lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%

There are many repeating blocks in the input and in each block I want to capture some things that are always there (%interestingbit% and %anotherinterestingbit%), but there is also a bit of text that may or may not occur in-between them (OPTIONAL_THING) and I want to capture it if it's there.

A RegEx like this matches only blocks with OPTIONAL_THING in it (and the named capture works):

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING)).+?%anotherinterestingbit%

So it seems like it's just a matter of making the whole group optional, right? That's what I tried:

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING))?.+?%anotherinterestingbit%

But I find that although this matches all 3 blocks the named capture (OptionalCapture) is empty in all of them! How do I get this to work?

Note that there can be a lot of text within each block, including newlines, which is why I put in ".+?" rather than something more specific. I'm using .NET regular expressions, testing with The Regulator.

A: 

Why do you have the extra set of parentheses?

Try this:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%

Or maybe this will work:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%

In this example, the group captures OPTIONAL_THING, or nothing.

strager
Nope, sorry, neither of these work. They're the same as my regex with the group being optional - all 3 blocks match, but without OPTIONAL_THING being captured.
Evgeny
@Evgeny, Are you sure .+? is making the wildcard "ungreedy?" Perhaps you can try .*? instead.
strager
@strager, tried that, doesn't make a difference
Evgeny
@Evgeny, Do any of the regex's work as expected when you turn the named group into a non-named/numbered group? Also, another option is doing something like /(currently working regex here|regex without OPTIONAL_THING here)/.
strager
@strager, no, whether it's named or not makes no difference. The big | doesn't work either, because it produces 2 matches for the above input with the first match being from the start of the first block to the end of the second one.
Evgeny
To me the problem seems to be at the first non-greedy match pattern. You're in effect matching up to OPTIONAL_THING or nothing, so the first .+? instantly finds "nothing" and stops matching. Because OPTIONAL_THING doesn't come right after, the second .+? matches the rest of the input. Right..?
Niko Nyman
A: 

Try this:

%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%

First there's a non-capturing group which matches .+OPTIONAL_THING or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING for you. The rest is captured with .+?%anotherinterestingbit%.

[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:

  • $1 : text before OPTIONAL_THING or nothing
  • $2 or $optionalCapture : OPTIONAL_THING or nothing
  • $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%

Are these the three matches you're looking for?

Niko Nyman
Sorry, but this has the same issue as using one big "|" - the first match includes two blocks, so there are only 2 matches in total, not 3.
Evgeny
Oooops.. edited my answer before noticing there was a new answer ABOVE my answer. Another thing learned about Stack Overflow -- the answers are not in chronological order...
Niko Nyman
+2  A: 

My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:

%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%

This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.

Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.

Bryan
Ta-ta-da-da! It works! Thanks very much.
Evgeny