views:

104

answers:

5

I am doing some fairly extensive string manipulations using regular expressions in Java. Currently, I have many blocks of code that look something like:

Matcher m = Pattern.compile("some pattern").matcher(text);
StringBuilder b = new StringBuilder();
int prevMatchIx = 0;
while (m.find()) {
 b.append(text.substring(prevMatchIx, m.start()));
 String matchingText = m.group(); //sometimes group(n)
 //manipulate the matching text
 b.append(matchingText);
 prevMatchIx = m.end();
}
text = b.toString()+text.substring(prevMatchIx);

My question is which of the two alternatives is more efficient (primarily time, but space to some extent):

1) Keep many existing blocks as above (assuming there isn't a better way to handle such blocks -- I can't use a simple replaceAll() because the groups must be operated on).

2) Consolidate the blocks into one big block. Use a "some pattern" that is the combination of all the old blocks' patterns using the |/alternation operator. Then, use if/else if within the loop to handle each of the matching patterns.

Thank you for your help!

A: 

First, does this need to be efficient? If not, don't bother -- complexification won't help code maintainability.

Assuming it does, doing them separately is usually the most efficient. This is especially true if there are large blocks of text in the expressions: without alternation this can be used to speed up matching, with it can't help at all.

If performance is really critical, you can code it several ways and test with sample data.

Charles
+2  A: 

If the order in which the replacements are made matters, you would have to be careful when using technique #1. Allow me to give an example: If I want to format a String so it is suitable for inclusion in XML, I have to first replace all & with &amp; and then make the other replacements (like < to &lt;). Using technique #2, you would not have to worry about this because you are making all the replacements in one pass.

In terms of performance, I think #2 would be quicker because you would be doing less String concatenations. As always, you could implement both techniques and record their speed and memory consumption to find out for certain. :)

Michael Angstadt
Thanks mangst. I have been careful to try to keep the order in mind. And I had the same intuition about String concatenations. I guess I'll just have to test it!
Noah_R-C
The way I see it, the overlap issue militates *against* option #1. Taking the argument a step further, suppose you were replacing every occurrence of `<` with `>`, and every occurrence of `>` with `<`. With the multi-regex/multi-pass approach you would have to replace one of them with some kind of placeholder, replace the other one, then go back and replace the placeholders. But with the single-pass/über-pattern approach, you don't have to worry about things like that.
Alan Moore
Michael Angstadt
+2  A: 

I'd suggest caching the patterns and having a method that uses the cache.

Patterns are expensive to compile so at least you will only compile them once and there is code reuse in using the same method for each instance. Shame about the lack of closures though as that would make things a lot cleaner.

   private static Map<String, Pattern> patterns = new HashMap<String, Pattern>();

   static Pattern findPattern(String patStr) {
      if (! patterns.containsKey(patStr))
         patterns.put(patStr, Pattern.compile(patStr));
      return patterns.get(patStr);
   }

   public interface MatchProcessor {
      public void process(String field);
   }

   public static void processMatches(String text, String pat, MatchProcessor processor) {
      Matcher m = findPattern(pat).matcher(text);

      int startInd = 0;
      while (m.find(startInd)) {
         processor.process(m.group());
         startInd = m.end();
      }
   }
Don Mackenzie
Thanks Don, but I am not reusing the same patterns, unfortunately...
Noah_R-C
If the compile cost is inevitable (due to every pattern being unique) then don't worry about the time impact, concentrate on all the boilerplate code that can be avoided by reuse. However if each regex has to be hand crafted I suspect that a more systematic approach might be taken, perhaps tokenisation? if you are free to present more detail in an edit, I'm sure that you'll get a useful response.
Don Mackenzie
A: 

Option #2 is almost certainly the better way to go, assuming it isn't too difficult to combine the regexes. And you don't have to implement it from scratch, either; the lower-level API that replaceAll() is built on (i.e., appendReplacement() and appendTail()), is also available for your use.

Taking the example that @mangst used, here's how you might process some text to be inserted into an XML document:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args)
  {
    String test_in = "One < two & four > three.";

    Pattern p = Pattern.compile("(&)|(<)|(>)");
    Matcher m = p.matcher(test_in);
    StringBuffer sb = new StringBuffer();  // (1)
    while (m.find())
    {
      String repl = m.start(1) != -1 ? "&amp;" :
                    m.start(2) != -1 ? "&lt;" :
                    m.start(3) != -1 ? "&gt;" : "";

      m.appendReplacement(sb, "");   // (2)
      sb.append(repl);
    }
    m.appendTail(sb);
    System.out.println(sb.toString());
  }
}

In this very simple example, all I need to know about each match is which capture group participated in it, which I find out by means of the start(n) method. But you can use the group() or group(n) method to examine the matched text, as you mentioned in the question.

Note (1) As of JDK 1.6, we have to use a StringBuffer here because StringBuilder didn't exist yet when the Matcher class was written. JDK 1.7 will add support for StringBuilder, plus some other improvements.

Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any $n sequence with the contents of the n'th capture group. We don't want that to happen, so we pass it an empty string and then append() the replacement string ourselves.

Alan Moore
Could you explain more why you think #2 would be faster? And thanks for the tip on appendTail/appendReplacement -- I'll use those from now on!
Noah_R-C
Option #2 is faster because it does all the work in one pass, while option #1 makes a separate pass for each regex. But performance is a secondary concern; the biggest problem with a multiple-pass approach is that with each pass you run the risk of stepping on the results of previous passes.
Alan Moore
A: 

Last time I was in your position I used a product called jflex.

Java's regex doesn't provide the traditional O(N log M) performance guarantees of true regular expression engines (for input strings of length N, and patterns of length M). Instead it inherits from its perl roots exponential time for some patterns. Unfortunately these pathological patterns, while rare in normal use, are all too common when combining regexes as you propose to do (I can attest to this from personal experience).

Consequently, my advice is to either:

a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or

b) switch to a lexer package such as jflex, which will provide a more declarative, and far more readable, syntax to approach this sort of cascading/sequential regex processing; and

c) seriously consider using a parser generator package. My current favourite is Beaver, but CUP is also a good option. Both of these are excellent tools and I highly recommend both of them, and as they both sit on top of jflex you can add them as/when you need them.

That being said, if you haven't used a parser-generator before and you are in a hurry, it will be easier to get up to speed with JavaCC. Not as powerful as Beaver/CUP but its parsing model is easier to understand.

Whatever you do, please don't use Antlr. It is very fashionable, and has great cheerleaders, but its online documentation sucks, its syntax is awkward, its performance is poor, and its scannerless design makes several common simple cases painful to handle. You would be better off using an abomination like sablecc(v1).

Note: Yes I have used everything I have mentioned above, and more besides; so this advice comes from personal experience.

Recurse
Thanks for the list of tools, and the tip on static final -- I am definitely going to make sure to do that.
Noah_R-C