tags:

views:

52

answers:

3

In .NET, regex is not organizing captures as I would expect. (I won't call this a bug, because obviously someone intended it. However, it's not how I'd expect it to work nor do I find it helpful.)

This regex is for recipe ingredients (simplified for sake of example):

(?<measurement>           # begin group
  \s*                     # optional beginning space or group separator
  (
    (?<integer>\d+)|      # integer
    (
      (?<numtor>\d+)      # numerator
      /
      (?<dentor>[1-9]\d*) # denominator. 0 not allowed
    )
  )
  \s(?<unit>[a-zA-Z]+)
)+                        # end group. can have multiple

My string: 3 tbsp 1/2 tsp

Resulting groups and captures:

[measurement][0]=3 tbsp
[measurement][1]= 1/2 tsp
[integer][0]=3
[numtor][0]=1
[dentor][0]=2
[unit][0]=tbsp
[unit][1]=tsp

Notice how even though 1/2 tsp is in the 2nd Capture, it's parts are in [0] since these spots were previously unused.

Is there any way to get all of the parts to have predictable useful indexes without having to re-run each group through the regex again?

+1  A: 

Seems like you probably need to loop through the input, matching one measurement at a time. Then you would have predictable access to the parts of that measurement, during the loop iteration for that measurement.

LarsH
A: 

Having a look at this....here's a couple of suggestions that might help improve the regexp

(?<measurement>           # begin group
  \s*                     # optional beginning space or group separator
  (
    (?<integer>\d+)\.?|   # integer
    (
      (?<numtor>\d+)      # numerator
      /
      (?<dentor>[1-9]\d*) # denominator. 0 not allowed
    )
  )
  \s(?<unit>[a-zA-Z]+)
)+                        # end group. can have multiple
  • The regex is expecting a space at the start.... after the measurement tag....
  • (?<integer>\d+) I would try \s? instead of \. to capture the whitespace as that is escaping the full-stop and would be expecting a full-stop to appear somewhere..
  • Escape the / like this to make it as a literal \/
  • What's the | separator for? that's making two exclusively mutual parts - either a 'integer' or a 'numtor' with a 'dentor'... that part looks confusing...
tommieb75
`/` has no special meaning in regexes. Some flavors use it as a delimiter for regex *literals* (JavaScript, for example), but in .NET it's just another character; you don't have to escape it.
Alan Moore
Thank you for taking time to answer, but I didn't need the regex analyzed -- it's just here to show the issue in question.
Dinah
+1  A: 

Is there any way to get all of the parts to have predictable useful indexes without having to re-run each group through the regex again?

Not with Captures. And if you're going to perform multiple matches anyway, I suggest you remove the + and match each component of the measurement separately, like so:

  string s = @"3 tbsp 1/2 tsp";

  Regex r = new Regex(@"\G\s* # anchor to end of previous match
    (?<measurement>           # begin group
      (
        (?<integer>\d+)       # integer
      |
        (
          (?<numtor>\d+)      # numerator
          /
          (?<dentor>[1-9]\d*) # denominator. 0 not allowed
        )
      )
      \s+(?<unit>[a-zA-Z]+)
    )                         # end group.
  ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

  foreach (Match m in r.Matches(s))
  {
    for (int i = 1; i < m.Groups.Count; i++)
    {
      Group g = m.Groups[i];
      if (g.Success)
      {
        Console.WriteLine("[{0}] = {1}", r.GroupNameFromNumber(i), g.Value);
      }
    }
    Console.WriteLine("");
  }

output:

[measurement] = 3 tbsp
[integer] = 3
[unit] = tbsp

[measurement] = 1/2 tsp
[numtor] = 1
[dentor] = 2
[unit] = tsp

The \G at the beginning ensures that matches occur only at the point where the previous match ended (or at the beginning of the input if this is the first match attempt). You can also save the match-end position between calls, then use the two-argument Matches method to resume parsing at that same point (as if that were really the beginning of the input).

Alan Moore