views:

55

answers:

1

Recently, I found one C# Regex API really annoying.

I have regular expression "(([0-9]+)|([a-z]+))+". I want to find all matched string. The code is like below.

        string regularExp = "(([0-9]+)|([a-z]+))+";
        string str = "abc123xyz456defFOO";

        Match match = Regex.Match(str, regularExp, RegexOptions.None);
        int matchCount = 0;

        while (match.Success)
        {
            Console.WriteLine("Match" + (++matchCount));

            Console.WriteLine("Match group count = {0}", match.Groups.Count);
            for (int i = 0; i < match.Groups.Count; i++)
            {
                Group group = match.Groups[i];
                Console.WriteLine("Group" + i + "='" + group.Value + "'");
            }

            match = match.NextMatch();
            Console.WriteLine("go to next match");
            Console.WriteLine();
        }

The output is:

Match1
Match group count = 4
Group0='abc123xyz456def'
Group1='def'
Group2='456'
Group3='def'
go to next match

It seems that all group.Value is the last matched string ("def" and "456"). I spent some time to figure out that I should count on group.Captures instead of group.Value.

        string regularExp = "(([0-9]+)|([a-z]+))+";
        string str = "abc123xyz456def";
        //Console.WriteLine(str);

        Match match = Regex.Match(str, regularExp, RegexOptions.None);
        int matchCount = 0;

        while (match.Success)
        {
            Console.WriteLine("Match" + (++matchCount));

            Console.WriteLine("Match group count = {0}", match.Groups.Count);
            for (int i = 0; i < match.Groups.Count; i++)
            {
                Group group = match.Groups[i];
                Console.WriteLine("Group" + i + "='" + group.Value + "'");

                CaptureCollection cc = group.Captures;
                for (int j = 0; j < cc.Count; j++)
                {
                    Capture c = cc[j];
                    System.Console.WriteLine("    Capture" + j + "='" + c + "', Position=" + c.Index);
                }
            }

            match = match.NextMatch();
            Console.WriteLine("go to next match");
            Console.WriteLine();
        }

This will output:

Match1
Match group count = 4
Group0='abc123xyz456def'
    Capture0='abc123xyz456def', Position=0
Group1='def'
    Capture0='abc', Position=0
    Capture1='123', Position=3
    Capture2='xyz', Position=6
    Capture3='456', Position=9
    Capture4='def', Position=12
Group2='456'
    Capture0='123', Position=3
    Capture1='456', Position=9
Group3='def'
    Capture0='abc', Position=0
    Capture1='xyz', Position=6
    Capture2='def', Position=12
go to next match

Now, I am wondering why the API design is like this. Why Group.Value only returns the last matched string? This design doesn't look good.

+2  A: 

The primary reason is historical: regexes have always worked that way, going back to Perl and beyond. But it's not really bad design. Usually, if you want every match like that, you just leave off the outermost quantifier (+ in ths case) and use the Matches() method instead of Match(). Every regex-enabled language provides a way to do that: in Perl or JavaScript you do the match in /g mode; in Ruby you use the scan method; in Java you call find() repeatedly until it returns false. Similarly, if you're doing a replace operation, you can plug the captured substrings back in as you go with placeholders ($1, $2 or \1, \2, depending on the language).

On the other hand, I know of no other Perl 5-derived regex flavor that provides the ability to retrieve intermediate capture-group matches like .NET does with its CaptureCollections. And I'm not surprised: it's actually very seldom that you really need to capture all the matches in one go like that. And think of all the storage and/or processing power it can take to keep track of all those intermediate matches. It is a nice feature though.

Alan Moore