views:

150

answers:

3

I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.

Example:

string modified;
for(int i = 0; i < num_strings; i++){
 modified = runReplacements(strs[i]);
 //do stuff
}

public runReplacements(String str){
  str = str.replace("foo","bar");
  str = str.replace("baz","beef");
  ....
  return str;
}

'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).

As you can imagine, I am concerned about performance :)

Given this,

  • replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.

  • replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.

Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:

  • Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)

  • Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));

This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.

So, which is the best for performance?

+3  A: 

So, which is the best for performance?

Measure it! ;-)

ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)

andersoj
Damn, beat me to it. @jonathon, you don't have a performance problem until you know you have a performance problem.
dty
and measure it in the context of your application doing what it is meant to do, it might seem like a lot of work, but it could easily be lost in the noise of db calls for any network traffic
Gareth Davis
+1  A: 

First, run and profile your entire application with a simple match/replace. This may show you that:

  • your application already runs fast enough, or
  • your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.

Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.

Next implement some simple alternatives and try them out. Is one of them good enough? Done!

In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)

In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.

Stephen C
A: 

Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!

I say use replaceAll();

(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)

[edit] PS. After that, you might want to measure it.

[edit 2] PPS. (and tell us what you found)

Dr.Dredel