views:

312

answers:

4

Can anyone explain:

  1. Why the two patterns used below give different results? (answered below)
  2. Why the 2nd example gives a group count of 1 but says the start and end of group 1 is -1?
 public void testGroups() throws Exception
 {
  String TEST_STRING = "After Yes is group 1 End";
  {
   Pattern p;
   Matcher m;
   String pattern="(?:Yes|No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

  {
   Pattern p;
   Matcher m;

   String pattern="(?:Yes)|(?:No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

 }

Which gives the following output:

Pattern=(?:Yes|No)(.*)End  Found=true Group count=1 Start of group 1=9 End of group 1=21
Pattern=(?:Yes)|(?:No)(.*)End  Found=true Group count=1 Start of group 1=-1 End of group 1=-1
+6  A: 
  1. The difference is that in the second pattern "(?:Yes)|(?:No)(.*)End", the concatenation ("X followed by Y" in "XY") has higher precedence than the choice ("Either X or Y" in "X|Y"), like multiplication has higher precedence than addition, so the pattern is equivalent to

    "(?:Yes)|(?:(?:No)(.*)End)"
    

    What you wanted to get is the following pattern:

    "(?:(?:Yes)|(?:No))(.*)End"
    

    This yields the same output as your first pattern.

    In your test, the second pattern has the group 1 at the (empty) range [-1, -1[ because that group did not match (the start -1 is included, the end -1 is excluded, making the half-open interval empty).

  2. A capturing group is a group that may capture input. If it captures, one also says it matches some substring of the input. If the regex contains choices, then not every capturing group may actually capture input, so there may be groups that do not match even if the regex matches.

  3. The group count, as returned by Matcher.groupCount(), is gained purely by counting the grouping brackets of capturing groups, irrespective of whether any of them could match on any given input. Your pattern has exactly one capturing group: (.*). This is group 1. The documentation states:

    (?:X)    X, as a non-capturing group
    

    and explains:

    Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group.

    Whether any specific group matches on a given input, is irrelevant for that definition. E.g., in the pattern (Yes)|(No), there are two groups ((Yes) is group 1, (No) is group 2), but only one of them can match for any given input.

  4. The call to Matcher.find() returns true if the regex was matched on some substring. You can determine which groups matched by looking at their start: If it is -1, then the group did not match. In that case, the end is at -1, too. There is no built-in method that tells you how many capturing groups actually matched after a call to find() or match(). You'd have to count these yourself by looking at each group's start.

  5. When it comes to backreferences, also note what the regex tutorial has to say:

    There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all.

Christian Semrau
Thank you for this answer. I'd still like to understand why the group count is 1. I understood (from the documentation and other experiments) that a group count of 1 should mean that a single numbered group had been found and therefore start(1) should be > -1.
The group count is gained purely by counting the grouping brackets, and your pattern has exactly one: `(.*)`. This is group 1. Whether any specific group matches on a given input, is irrelevant for that definition. E.g., in the pattern `"(Yes)|(No)"`, there are two groups ("(Yes)" is group 1, "(No)" is group 2), but only one of them can match for any given input.
Christian Semrau
So you are saying that where the documentation says "Returns the number of capturing groups in this matcher's pattern." it means the count in the expression even if there is no match? In that case why does the call to find() return true? Or to put it another way, how is one intended to determine whether any groups matched and if so how many?
+2  A: 

Due to the precedence of the "|" operator in the pattern, the second pattern is equivalent to:

(?:Yes)|((?:No)(.*)End)

What you want is

(?:(?:Yes)|(?:No))(.*)End
jimr
You're wrong about groupCount, as the Javadoc clearly explains: *Group zero denotes the entire pattern by convention. It is **not** included in this count.* It's unintuitive.
Mark Peters
Ack.. i was wrong, removing from the answer
jimr
+1  A: 

When using regular expression is it important to remember there there is an implicit AND operator at work. This can be seen from the JavaDoc for java.util.regex.Pattern covering the logical operators:

Logical operators
XY X followed by Y
X|Y Either X or Y
(X) X, as a capturing group

This AND takes precedence over the OR in the second Pattern. The second Pattern is equivalent to
(?:Yes)|(?:(?:No)(.*)End).
In order for it to be equivalent to the first Pattern it must be changed to
(?:(?:Yes)|(?:No))(.*)End

Jacob Tomaw
+2  A: 

To summarise,

1) The two patterns give different results because of the precedence rules of the operators.

  • (?:Yes|No)(.*)End matches (Yes or No) followed by .*End
  • (?:Yes)|(?:No)(.*)End matches (Yes) or (No followed by .*End)

2) The second pattern gives a group count of 1 but a start and end of -1 because of the (not necessarily intuitive) meanings of the results returned by the Matcher method calls.

  • Matcher.find() returns true if a match was found. In your case the match was on the (?:Yes) part of the pattern.
  • Matcher.groupCount() returns the number of capturing groups in the pattern regardless of whether the capturing groups actually participated in the match. In your case only the non capturing (?:Yes) part of the pattern participated in the match, but the capturing (.*) group was still part of the pattern so the group count is 1.
  • Matcher.start(n) and Matcher.end(n) return the start and end index of the subsequence matched by the n th capturing group. In your case, although an overall match was found, the (.*) capturing group did not participate in the match and so did not capture a subsequence, hence the -1 results.

3) (Question asked in comment.) In order to determine how many capturing groups actually captured a subsequence, iterate Matcher.start(n) from 0 to Matcher.groupCount() counting the number of non -1 results. (Note that Matcher.start(0) is the capturing group representing the whole pattern, which you may want to exclude for your purposes.)