views:

219

answers:

4

The following code

string expression = "(\\{[0-9]+\\})";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex tokenParser = new Regex(expression, options);

MatchCollection matches = tokenParser.Matches("The {0} is a {1} and the {2} is also a {1}");

will match and capture "{0}", "{1}", "{2}" and "{1}".

Is it possible to change it (either the regular expression or option of the RegEx) so that it would match and capture "{0}", "{1}" and "{2}". In other words, each match should only be captured once?

A: 

If you only want one instance change

string expression = "(\\{[0-9]+\\})"; \\one or more repetitions

to

string expression = "(\\{[0-9]{1}})";  \\Exactly 1 repetition
mcauthorn
Not going to work. Tokens {10}, {11}, etc. will no longer match and multiple instances of {0}, {1} to {9} will still be captured if they exist.
Steve Crane
Also, it you only want to match a single digit, the {1} count specifier is redundant.
Steve Crane
+1  A: 

Regular expressions solve lots of problems, but not every problem. How about using other tools in the toolbox?

var parameters = new HashSet<string>(
    matches.Select(mm => mm.Value).Skip(1));

Or

var parameters = matches.Select(mm => mm.Value).Skip(1).Distinct();
sixlettervariables
Meta comment, the 0th match is the entire matching corpus.
sixlettervariables
I was thinking of something like this to make the matches unique after the regex does its work. Just wondered if the regex itself might have some magic to do this itself without additional code. See my answer for the solution I came up with.
Steve Crane
Sometimes you can finagle what you want out of Regex, but often at the cost of readability or performance. I tend to take the easy route and see if I need more out of it :-D
sixlettervariables
A: 

Here is what I came up with.

private static bool TokensMatch(string t1, string t2)
{
  return TokenString(t1) == TokenString(t2);
}

private static string TokenString(string input)
{
  Regex tokenParser = new Regex(@"(\{[0-9]+\})|(\[.*?\])");

  string[] tokens = tokenParser.Matches(input).Cast<Match>()
      .Select(m => m.Value).Distinct().OrderBy(s => s).ToArray<string>();

  return String.Join(String.Empty, tokens);
}

Note that the difference in the regular expression from the one in my question is due to the fact that I cater for two types of token; numbered ones delimited by {} and named ones delimited by [];

Steve Crane
RegexOptions.Compiled may help along with moving that Regex out of the method and making it static.
sixlettervariables
+1  A: 

Here's something you could use for a pure regex solution:

Regex r = new Regex(@"(\{[0-9]+\}|\[[^\[\]]+\])(?<!\1.*\1)",
                    RegexOptions.Singleline);

But for the sake of both efficiency and maintainability, you're probably better off with a mixed solution like the one you posted.

Alan Moore
Thanks Alan. I will stay with my current solution but it's good to expand my knowledge of regular expressions.
Steve Crane
Doing the distinct checking outside the regex is faster too. Tested by changing the expression and removing the Distinct() call. Returns the same result but takes almost twice the time. A good reminder that overusing regular expressions, or any tool, may not always be the best solution.
Steve Crane