tags:

views:

47

answers:

2

Hi,

I can't seem to figure out captures + groups in Regex (.net).

Let's say I have the following input string, where each letter is actually a placeholder for more complex regex expression (so simple character exclusion won't work):

CBDAEDBCEFBCD

Or, more generically, here is a string pattern written in 'regex':

(C|B|D)*A(E*)(D|B|C)*(E*)F(B|C|D)*

There will only be one A and one F. I need to capture as individual 'captures' (or matches or groups) all instances of B, C, D (which in my app are more complex groups) that occur after A and before F. I also need A and F. I don't need E. And I don't need the C,B,D before the A or the B,C,D after the F.

I would expect the correct result to be:

Groups["start"] (1 capture) = A
Groups["content"] (3 captures)  
  Captures[0] = D  
  Captures[1] = B
  Captures[2] = C
Groups["end"] (1 capture) = F

I tried a few feeble attempts but none of them worked.

Only "incorrectly" captures the last C before EF in the sample string above (as well as correctly start = A, end = F)

(?<=(?<start>A)).+(?<content>B|C|D).+(?=(?<end>F))

Same results as above (just added a + after (?B|C|D) )

(?<=(?<start>A)).+(?<content>B|C|D)+.+(?=(?<end>F))

Got rid of look-around stuff... same result as above

(?<start>A).+(?<content>B|C|D)+.+(?<end>F)

And then my good-for-nothing brain went on strike.

So, what's the right way to approach this? Are look-arounds really needed for this or not?

Thanks!

A: 

Since you said all instance of C,B,D, I would think you'd want to use a grouping for that [CBD]* Also, if you're just looking for something to be after the letter A but before F, then you should be able to use those literals along with some exclusions.

Here's a pattern I came up with. Group $4 should contain the letter DBC

([^A]*)(A)([^CBDF]*)([CBD]*)([^F]*)(F)(.*)

Here's an example of this pattern in action.

The question is, what do you want if the original string is CBDAEDEBECEFBCD?

Snekse
Sorry, all the letters are place holders for more complex groups (I'll update the question) - so I can't just use literal exclusions. The string CBDAEDEBECEFBCD you suggest shouldn't match at all -- there's just a bunch of E's between A and the first (B|C|D), and a bunch of E's immediately before the F. Again, in my app, they're not just E's, they're just text that I don't need.
Jimmy
If that's the case, then look-arounds are probably your only option.
Snekse
Are you able to suggest a look around that works? Even with look arounds I still can't get it to work.
Jimmy
+1  A: 

Yeah, forget the lookarounds, they just complicate things needlessly. But I suspect your final regex will work if you make that first .+ reluctant:

(?<start>A).+?(?<content>B|C|D)+.+(?<end>F)

EDIT: yep:

string s = "CBDAEDBCEFBCD";
Regex r = new Regex(@"(?<start>A).+?(?<content>B|C|D)+.+(?<end>F)");

foreach (Match m in r.Matches(s))
{
  Console.WriteLine(@"Groups[""start""] = {0}", m.Groups["start"]);
  foreach (Capture c in m.Groups["content"].Captures)
  {
    Console.WriteLine(@"Capture[""content""] = {0}", c.Value);
  }
  Console.WriteLine(@"Groups[""end""] = {0}", m.Groups["end"]);
}

output:

Groups["start"] = A
Capture["content"] = D
Capture["content"] = B
Capture["content"] = C
Groups["end"] = F
Alan Moore
Really! So lookarounds are not needed huh... amazing.
Jimmy
@Jimmy: In this case, yes, lookarounds are not needed. But they do have their uses. :P
Alan Moore