views:

255

answers:

4

I've seen people here made comments like "regex is too slow!", or "why would you do something so simple using regex!" (and then present a 10+ lines alternative instead), etc.

I haven't really used regex in industrial setting, so I'm curious if there are applications where regex is demonstratably just too slow, AND where a simple non-regex alternative exists that performs significantly (maybe even asymptotically!) better.

Obviously many highly-specialized string manipulations with sophisticated string algorithms will outperform regex easily, but I'm talking about cases where a simple solution exists and significantly outperforms regex.

What counts as simple is subjective, of course, but I think a reasonable standard is that if it uses only String, StringBuilder, etc, then it's probably simple.


Note: I would very much appreciate answers that demonstrate the following:

  1. a beginner-level regex solution to a non-toy real-life problem that performs horribly
  2. the simple non-regex solution
  3. the expert-level regex rewrite that performs comparably
+4  A: 

Is regex too slow?

Regex is not intrinsically slow. basic pattern matching is O(n), hard to improve on, certainly for non-trivial patterns.

Henk Holterman
I would caveat this, it's not slow **if used correctly**...just like anything else. Unfortunately though, Regex is used incorrectly more often than other things it seems...but that's the programmer's fault, not Regex itself.
Nick Craver
...there can be a HUGE difference between two O(n) implementations...
danbystrom
Not all regex implementations are O(n). http://swtch.com/~rsc/regexp/regexp1.html
JUST MY correct OPINION
@Nick Craver, I have to disagree when you say that it's only the programmer's fault. Technically you are of course right, but when it's common that something is used incorrectly, I would usually say the design is faulty. Having said that, what Regex does and can do is complex, so IMO is understandable and even acceptable that it is "faulty".
fish
@nick, @danby, all correct but I have seen people stray away from regex because they imagine some horrible backtracking interpreter. Not everyone knows about DFAs
Henk Holterman
@ttmric, interesting link but I'm quite sure that Java uses the Perl approach.
Henk Holterman
@fish - I guess I can further clarify that it's *easy to use incorrectly*...in my opinion, easier than a lot of other technologies. If I could find it on google (how do you search and find this heh?), I would show you were hotmail was searching for `xxx` in a greedy fashion and sending a few thousands `x` in a row would easily bring down a server...all because 2 characters in a pattern string were off. I guess anything complicated is going to be more error-prone (**user** error-prone that is), Regex is just an example of that.
Nick Craver
@Nick: I'd love to see examples of using regex incorrectly (as in having _performance_ problems but functionally correct), and perhaps what the correct (i.e. faster) way is.
polygenelubricants
@polygenelubricants: I added just such an example (although not taken from production code).
Christian Semrau
@Nick - exactly :)
fish
+1  A: 

Well, not always but sometimes slow, depends on patterns and implementations.

A quick example, 2x slower than normal replace, but I don't think its that slow.

>>> import time,re
>>>
>>> x="abbbcdexfbeczexczczkef111anncdehbzzdezf" * 500000
>>>
>>> start=time.time()
>>> y=x.replace("bc","TEST")
>>> print time.time()-start,"s"
0.350999832153 s
>>>
>>> start=time.time()
>>> y=re.sub("bc","TEST",x)
>>> print time.time()-start,"s"
0.751000165939 s
>>>
S.Mark
+1, A "Real life example". But only because of the simplicity of "bc". Change the requirement to: replace every sequence of 1 or more 'b' characters and you can no longer use a single lib method.
Henk Holterman
@Henk Holterman: Yes, but your new example is what regular expressions are there for, while simple, _static_ replacement is not.
Svante
+5  A: 

I remember a textbook example. The mistake made in that example is quite common: Using a dot where a narrower character class is better suited.

In a CSV file containing on each line exactly 12 integers separated by commas, find the lines that have a 13 in the 6th position (no matter where else a 13 may be).

1, 2, 3, 4, 5, 6, 7, 8 ,9 ,10,11,12 -- don't match
42,12,13,12,32,13,14,43,56,31,78,10 -- match
42,12,13,12,32,14,13,43,56,31,78,10 -- don't match

We use a regex containing exactly 11 commas:

".*,.*,.*,.*,.*,13,.*,.*,.*,.*,.*,.*"

This way, each ".*" is confined to a single number. This regex solves the task, but has very bad performance. (Roughly 600 microseconds per string on my computer, with little difference between matched and unmatched strings.)

A simple non-regex solution would be to split() each line and compare the 6th element. (Much faster: 9 microseconds per string.)

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ".*" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

So we replace the greedy quantifier with the reluctant one:

".*?,.*?,.*?,.*?,.*?,13,.*?,.*?,.*?,.*?,.*?,.*?"

This performs way better for a matched string (by a factor of 100), but has almost unchanged performance for a non-matched string.

A performant regex replaces the dot by the character class "[^,]":

"[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,13,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*"

(This needs 3.7 microseconds per string for the matched string and 2.4 for the unmatched strings on my computer.)

Christian Semrau
So the regex, in this case, is faster than the simple alternative of using `split()`
donut
Yes it is, mainly because `split()` uses a regex internally. Even faster than the performant regex (but less readable) is a StringTokenizer: `StringTokenizer st = new StringTokenizer(input, ","); for (int i = 0; i < 5; i++) { st.nextToken(); } boolean match = "13".equals(st.nextToken());`
Christian Semrau
+2  A: 

I experimented a bit with the performance of various constructs, and unfortunately I discovered that Java regex doesn't perform what I consider very doable optimizations.

Java regex takes O(N) to match "(?s)^.*+$"

This is very disappointing. It's understandable for ".*" to take O(N), but with the optimization "hints" in the form of anchors (^ and $) and single-line mode Pattern.DOTALL/(?s), even making the repetition possessive (i.e. no backtracking), the regex engine still could not see that this will match every string, and still have to match in O(N).

This pattern isn't very useful, of course, but consider the next problem.

Java regex takes O(N) to match "(?s)^A.*Z$"

Again, I was hoping that the regex engine can see that thanks to the anchors and single-line mode, this is essentially the same as the O(1) non-regex:

 s.startsWith("A") && s.endsWith("Z")

Unfortunately, no, this is still O(N). Very disappointing. Still, not very convincing because a nice and simple non-regex alternative exists.

Java regex takes O(N) to match "(?s)^.*[aeiou]{3}$"

This pattern matches strings that ends with 3 lowercase vowels. There is no nice and simple non-regex alternative, but you can still write something non-regex that matches this in O(1), since you only need to check the last 3 characters (for simplicity, we can assume that the string length is at least 3).

I also tried "(?s)^.*$(?<=[aeiou]{3})", in an attempt to tell the regex engine to just ignore everything else, and just check the last 3 characters, but of course this is still O(N) (which follows from the first section above).

In this particular scenario, however, regex can be made useful by combining it with substring. That is, instead of seeing if the whole string matches the pattern, you can manually restrict the pattern to attempt to match only the last 3 characters substring. In general, if you know before hand that the pattern has a finite length maximum match, you can substring the necessary amount of characters from the end of a very long string and regex on just that part.


Test harness

static void testAnchors() {
    String pattern = "(?s)^.*[aeiou]{3}$";
    for (int N = 1; N < 20; N++) {
        String needle = stringLength(1 << N) + "ooo";
        System.out.println(N);
        boolean b = true;
        for (int REPS = 10000; REPS --> 0; ) {
            b &= 
              needle
              //.substring(needle.length() - 3) // try with this
              .matches(pattern);
        }
        System.out.println(b);
    }
}

The string length in this test grows exponentially. If you run this test, you will find that it starts to really slow down after 10 (i.e. string length 1024). If you uncomment the substring line, however, the entire test will complete in no time (which also confirms that the problem is not because I didn't use Pattern.compile, which would yield a constant improvement at best, but rather because the patttern takes O(N) to match, which is problematic when the asymptotic growth of N is exponential).


Conclusion

It seems that Java regex does little to no optimization based on the pattern. Suffix matching in particular is especially costly, because the regex still needs to go through the entire length of the string.

Thankfully, doing the regex on the chopped suffix using substring (if you know the maximum length of the match) can still allow you to use regex for suffix matching in time independent of the length of the input string.

//update: actually I just realized that this applies to prefix matching too. Java regex matches a O(1) length prefix pattern in O(N). That is, "(?s)^[aeiou]{3}.*$" checks if a string starts with 3 lowercase letters in O(N) when it should be optimizable to O(1).

I thought prefix matching would be more regex-friendly, but I don't think it's possible to come up with a O(1)-runtime pattern to match the above (unless someone can prove me wrong).

Obviously you can do the s.substring(0, 3).matches("(?s)^[aeiou]{3}.*$") "trick", but the pattern itself is still O(N); you've just manually reduced N to a constant by using substring.

So for any kind of finite-length prefix/suffix matching of a really long string, you should preprocess using substring before using regex; otherwise it's O(N) where O(1) suffices.

polygenelubricants
How would regex optimize based on the pattern? Fire another regex on it? ;) The `java.util.Pattern` source is by the way quite interesting. It uses a parser/interpreter to execute regex.
BalusC
Unless I'm mistaken, `"(?s)^.*$(?<=[aeiou]{3})"` should be optimizable to `O(1)`. The way I understand it, in single-line `(?s)/Pattern.DOTALL` mode, `^.*$` is an instant `O(1)` match to everything. The lookbehind from the `$` anchor is "obviously" a simple suffix matching attempt. I think it's very possible that some sophisticated regex implementation can optimize this to `O(1)`, no?
polygenelubricants
As a matter of fact, there's an RFE from 2007 requesting that `matches()` or `find()` skip regex matching entirely and simply return `true` in the case of `.*`. The submitter hadn't thought it out as far as you have, but I still don't think it's worth the effort. There can be many reasons to reject regexes as a solution, depending on the nature of the project, the tool set, and the capabilities of the programmers; performance is almost never the deciding factor. ref: http://bugs.sun.com/view_bug.do?bug_id=6565414
Alan Moore
@Alan: what I learned from this exploration is that `.*` itself isn't as trivial as it looks: `"Hello!\nWorld!".matches(".*")` is `false`! It's only `true` in single-line mode `(?s)`.
polygenelubricants
The problem is always the .* One would not use it that way , but instead the s.matcher("^[aeiou]{3}").find() // or was it the other way round?With .* you want to collect n characters into group 0, so it must be O(N)
Ingo