Java Matcher groups: Understanding The difference between "(?:X|Y)" and "(?:X)|(?:Y)"

user358795 picture user358795 · Jun 4, 2010 · Viewed 8.2k times · Source

Can anyone explain:

  1. Why the two patterns used below give different results? (answered below)
  2. Why the 2nd example gives a group count of 1 but says the start and end of group 1 is -1?
 public void testGroups() throws Exception
 {
  String TEST_STRING = "After Yes is group 1 End";
  {
   Pattern p;
   Matcher m;
   String pattern="(?:Yes|No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

  {
   Pattern p;
   Matcher m;

   String pattern="(?:Yes)|(?:No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

 }

Which gives the following output:

Pattern=(?:Yes|No)(.*)End  Found=true Group count=1 Start of group 1=9 End of group 1=21
Pattern=(?:Yes)|(?:No)(.*)End  Found=true Group count=1 Start of group 1=-1 End of group 1=-1

Answer

Christian Semrau picture Christian Semrau · Jun 4, 2010
  1. The difference is that in the second pattern "(?:Yes)|(?:No)(.*)End", the concatenation ("X followed by Y" in "XY") has higher precedence than the choice ("Either X or Y" in "X|Y"), like multiplication has higher precedence than addition, so the pattern is equivalent to

    "(?:Yes)|(?:(?:No)(.*)End)"
    

    What you wanted to get is the following pattern:

    "(?:(?:Yes)|(?:No))(.*)End"
    

    This yields the same output as your first pattern.

    In your test, the second pattern has the group 1 at the (empty) range [-1, -1[ because that group did not match (the start -1 is included, the end -1 is excluded, making the half-open interval empty).

  2. A capturing group is a group that may capture input. If it captures, one also says it matches some substring of the input. If the regex contains choices, then not every capturing group may actually capture input, so there may be groups that do not match even if the regex matches.

  3. The group count, as returned by Matcher.groupCount(), is gained purely by counting the grouping brackets of capturing groups, irrespective of whether any of them could match on any given input. Your pattern has exactly one capturing group: (.*). This is group 1. The documentation states:

    (?:X)    X, as a non-capturing group
    

    and explains:

    Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group.

    Whether any specific group matches on a given input, is irrelevant for that definition. E.g., in the pattern (Yes)|(No), there are two groups ((Yes) is group 1, (No) is group 2), but only one of them can match for any given input.

  4. The call to Matcher.find() returns true if the regex was matched on some substring. You can determine which groups matched by looking at their start: If it is -1, then the group did not match. In that case, the end is at -1, too. There is no built-in method that tells you how many capturing groups actually matched after a call to find() or match(). You'd have to count these yourself by looking at each group's start.

  5. When it comes to backreferences, also note what the regex tutorial has to say:

    There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all.