views:

1423

answers:

4

I need to replace all & in a String that isnt part of a HTML entity. So that the String "This & entites > & <" will return "This & entites > & <"

And I've come up with this regex-pattern: "&[a-zA-Z0-9]{2,7};" which works fine. But I'm not very skilled in regex, and when I test the speed over 100k iterations, it uses double amount of time over a previous used method, that didnt use regex. (But werent working 100% either).

Testcode:

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;"
for (int i = 0; i < 100000; i++) {test=s.replaceAll(reg, "&amp;");}
System.out.println("Finished in:" + (System.currentTimeMillis() - time) + " milliseconds");

So the question would be whether there is some obvious ways of optimize this regex expression for it to be more effective?

A: 

I'm not very familiar with the Java regex classes, but in general you may want to investigate a zero width lookahead for ; after the ampersand.

Here is a link describing positive and negative lookaheads

John Weldon
This is the page I've been looking at when I made this in fact :), took a look at positive vs negative lookaheads, but the changes I tried, didnt increase efficiency.
Duveit
A: 

Another way of doing this wihtout blowing your head with regexp would be to use StringEscapeUtils from Commons Lang.

Valentin Rocher
Duveit
+4  A: 

s.replaceAll(reg, "&amp;") is compiling the regular expression every time. Compiling the pattern once will provide some increase in performance (~30% in this case).

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
Pattern p = Pattern.compile(reg);
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;";
for (int i = 0; i < 100000; i++) {
    String test = p.matcher(s).replaceAll("&amp;");
}
System.out.println("Finished in:" + 
             (System.currentTimeMillis() - time) + " milliseconds");
Chris Thornhill
That is true, it got it down from 550ms to 450ms. I'll see if we can implement the precompiled pattern.
Duveit
+1  A: 

You have to exclude the & from your look-ahead assertion. So try this regular expression:

&(?!#?[a-zA-Z0-9]{2,7};)

Or to be more precise:

&(?!(?:#(?:[xX][0-9a-fA-F]|[0-9]+)|[a-zA-Z]+);)
Gumbo