Hi, guys! I am making a small applicaiton using .NET Regex types. And the "Capture, Group and Match" types totally confused me. I have never seen such an ugly solution. Could someone explain their usage for me? Many thanks.
A match is the result of any individual match of the entirety of a regex. Groups and Captures both have something to do with capture groups (each (expression)
from within the regex), but vary in how they behave. Here's a quote from the MSDN article on the Capture class that explains the difference:
If you do not apply a quantifier to a capturing group, the Group.Captures property returns a CaptureCollection with a single Capture object that provides information about the same capture as the Group object. If you do apply a quantifier to a capturing group, the Group.Index, Group.Length, and Group.Value properties provide information only about the last captured group, whereas the Capture objects in the CaptureCollection provide information about all subexpression captures. The example provides an illustration.
(Source)
Here's a simpler example than the one in the document @Dav cited:
string s0 = @"foo%123%456%789";
Regex r0 = new Regex(@"^([a-z]+)(?:%([0-9]+))+$");
Match m0 = r0.Match(s0);
if (m0.Success)
{
Console.WriteLine(@"full match: {0}", m0.Value);
Console.WriteLine(@"group #1: {0}", m0.Groups[1].Value);
Console.WriteLine(@"group #2: {0}", m0.Groups[2].Value);
Console.WriteLine(@"group #2 captures: {0}, {1}, {2}",
m0.Groups[2].Captures[0].Value,
m0.Groups[2].Captures[1].Value,
m0.Groups[2].Captures[2].Value);
}
result:
full match: foo%123%456%789
group #1: foo
group #2: 789
group #2 captures: 123, 456, 789
The full match
and group #1
results are straightforward, but the others require some explanation. Group #2, as you can see, is inside a non-capturing group that's controlled by a +
quantifier. It matches three times, but if you request its Value
, you only get what it matched the third time around--the final capture. Similarly, if you use the $2
placeholder in a replacement string, the final capture is what gets inserted in its place.
In most regex flavors, that's all you can get; each intermediate capture is overwritten by the next and lost; .NET is almost unique in preserving all of the captures and making them available after the match is performed. You can access them directly as I did here, or iterate through the CaptureCollection
as you would a MatchCollection
. There's no equivalent for the $1
-style replacement-string placeholders, though.
So the reason the API design is so ugly (as you put it) is twofold: first it was adapted from Perl's integral regex support to .NET's object-oriented framework; then the CaptureCollection
structure was grafted onto it. Perl 6 offers a much cleaner solution, but the authors accomplished that by rewriting Perl practically from scratch and throwing backward compatibility out the window.