views:

475

answers:

3

An extension to my previous question:
Text cleaning and replacement: delete \n from a text in Java

I am cleaning this incoming text, which comes from a database with irregular text. That means, there' s no standard or rules. Some contain HTML characters like &reg, &trade, &lt, and others come in this form: &#8221, &#8211, etc. Other times I just get the HTML tags with < and >.

I am using String.replace to replace the characters by their meaning (this should be fine since I'm using UTF-8 right?), and replaceAll() to remove the HTML tags with a regular expression.

Other than one call to the replace() function for each replacement, and compiling the HTML tags regular expression, is there any recommendation to make this replacement efficient?

+3  A: 

You're going to run into performance bottlenecks with replace with replaceAll.

If you want to increase performance

  1. don't use replace - strings are immutable. Will create copies each time you call replace
  2. don't use regular expressions (replaceAll) - though being compiled it's not as bad.
  3. parse and replace your own strings using StringBuilder

Some code on your end might help discussion.

String str = ...
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
  char toAppend = 'a';
  switch (str.charAt(i)) {
     case '&' : toAppend = '&'; break;
     case ...
      ...
      default: toAppend = str.charAt(i);
    }
   sb.append(toAppend);
  }

this is from: Sun Forums

Nathan
+7  A: 

My first suggestion is to measure the performance of the simplest way of doing it (which is probably multiple replace/replaceAll calls). Yes, it's potentially inefficient. Quite often the simplest way of doing this is inefficient. You need to ask yourself: how much do you care?

Do you have sample data and a threshold at which point the performance is acceptable? If you don't, that's the first port of call. Then test the naive implementation, and see whether it really is a problem. (Bear in mind that string replacement is almost certainly only part of what you're doing. As you're fetching the text from a database to start with, that may well end up being the bottleneck.)

Once you've determined that the replacement really is the bottleneck, it's worth performing some tests to see which bits of the replacement are causing the biggest problem - it sounds like you're doing several different kinds of replacement. The more you can narrow it down, the better: you may find that the real bottleneck in the simplest code is caused by something which is easy to make efficient in a reasonably simple way, whereas trying to optimise everything would be a lot harder.

Jon Skeet
+1 because you're right even though adding to the 35k seems just wrong. Where will it end...?
Steve B.
@SteveB: You'll be glad to hear that upvoting me at this time of day makes no odds to my rep.
Jon Skeet
+2  A: 

String replacement is Java is very slow, i think you should use faster language

here you have code sample how to replace string in assembly

http://szabgab.com/talks/fundamentals_of_perl/replace-string-in-assembly-code.html

01