tags:

views:

345

answers:

5

I have some programmatically assembled huge regex, like this

(A)|(B)|(C)|...

Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i) to see it returns a non-null string?

+3  A: 

If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.

Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.

Thomas
Yes they might share prefixes etc.
Fortepianissimo
See above. Because of the implementation, this does not matter speed-wise...
Thomas
A: 

Break up your regex into three:

String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };

for(int i = 0; i < regexes.length; i++) {
  Pattern pattern = Pattern.compile(regexes[i]);

  Matcher matcher = pattern.matcher(inputStr);
  if(matcher.matches()) {
     //process, optionally break out of loop
  }
}

public int getMatchedGroupIndex(Matcher matcher) { 
  int index = -1;  

  for(int i = 0; i < matcher.groupCount(); i++) {
    if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
      index = i;
    }
  }

  return index;
}

The alternative is:

for(int i = 0; i < matcher.groupCount(); i++) {
  if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
     //process, optionally break out of loop
  }
}
OMG Ponies
I don't want to do a linear search. I'm asking if I can get functionality of this non-existent method Matcher.getMatchedGroupIndex() which will magically tell me which group is matched without me wading through each group to test it.
Fortepianissimo
I added the getMatchedGroupIndex() method, but under the covers it will still use a FOR loop to iterate through the group contents.
OMG Ponies
A: 

I don't think you can get around the linear search, but you can make it a lot more efficient by using start(int) instead of group(int).

static int getMatchedGroupIndex(Matcher m)
{ 
  int index = -1;
  for (int i = 1, n = m.groupCount(); i <= n; i++)
  {
    if ( (index = m.start(i)) != -1 )
    {
      break;
    }
  }
  return index;
}

This way, instead of generating a substring for every group, you just query an int value representing its starting index.

Alan Moore
A: 

From the various comments, it seems that the simple answer is "no", and that using separate regexes is a better idea. To improve on that approach, you might need to figure out the common pattern prefixes when you generate them, or use your own regex (or other) pattern matching engine. But before you go to all of that effort, you need to be sure that this is a significant bottleneck in your system. In other words, benchmark it and see if the performance is acceptable for realistic input data, and if not the profile it to see where the real bottlenecks are.

Stephen C
A: 

You can use non-capturing groups, instead of:

(A)|(B)|(C)|...

replace with

((?:A)|(?:B)|(?:C))

The non-capturing groups (?:) will not be included in the group count, but the result of the branch will be captured in the outer () group.

jwiebe