tags:

views:

283

answers:

4

I noticed that it is very slow for a Regex to finish a XML file with 3000 lines [1]:

\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=

I always thought that Regexes are efficient. Why does it take so long to finish the Regex?

[1] http://stackoverflow.com/questions/736376/how-can-i-repeatedly-match-from-a-until-b-in-vim

+3  A: 

REs are efficient, but they're not magic :-)

It is not the number of lines in a input file which slows you down (3,000 is a very small number). When an RE is compiled, it basically has to (conceptually) write a program which can parse the input file based on that RE.

The speed of that "program" always depends on the complexity of the RE.

For example, "^$" will be very fast, "[0-9]{1,9}" will be somewhat slower and anything that involves backtracking (i.e., having to back up in the RE, usually anything involving multiple variable-number-of-elements clauses, of which yours is an example) will be slower still.

Anything you can do to minimize the number of lines beforehand will help to some extent but as to optimizing the RE itself, that's often considered a black art. One possibility is to first strip out the lines other than those between lines where the Annotation stops and starts.

I don't tend to worry too much about optimizing my REs (but they're not usually this complex). My view is that they will take as long as they take. If that's too long, I usually look for another solution which is faster but not so adaptable.

In the case of your RE where you wanted to get all Annotation XML where the about attribute contains MATCH, I would do it in Perl (or awk for us old-timers :-) since the input file was reasonably fixed format:

  • "<Annotation " on first line [a].
  • "MATCH" also on first line [a].
  • </Annotation> on last line and on its own [b].

This would be fast as a simple line scanner, turning on echo when the [a] conditions were met (and printing that line), printing any other line when echo was on, and turning echo off when [b] conditions were met (after printing line).

Yes, far less adaptable but almost certainly faster (given your well-formatted input).

paxdiablo
+ 1 for comparing regexes. I found on the final page of "The book of numbers By John Horton Conway" some comparison between different functions.
Masi
+5  A: 

It depends on the regular expression itself if it is efficient or not. What it makes inefficient is backtracking. And to avoid this, the regular expression has to be as distinct as possible.

Let’s take the regular expression a.*b as an example and apply it to the string abcdef. The algorithm will first match the literal a in a.*b to the a in abcdef. Next the expression .* will be processed. In the normal greedy mode, where multipliers are expanded to the maximum, it will match to the whole rest bcdef in abdef. Than the last literal b in a.*b will be processed. But as the end of the string is already reached and a mulpliplier is in use, the algorithm will try backtracking to match the whole pattern. The match of .* (bcdef) will be decreased by one character (bcde) and the algorithm tries to comply the rest of the pattern. But the b in a.*b doesn’t match the f in abcdef. So .* will be decreased by one more character until it matches the empty string (thus . is repeated zero times) and the b in a.*b matches the b in abcdef.

As you can see, a.*b applied to abdef needs 6 backtracking approaches for .* until the whole regular expression matches. But if we alter the regular expression and make it distinct by using a[^b]*b instead, there is be no backtracking necessary and the regular expression can be matches within the first approach.

And if you now consider using lazy modifiers instead, I’ve to tell you, that this rules apply to every modifier, both the greedy and lazy modifiers. The difference is instead of first expanding the match to the maximum and than doing backtracking by decreasing the match one character at a time (greedy), the lazy modifiers will first be expanded to the minimum match and than be increased one character at a time.

Gumbo
+1  A: 

"Regexes are efficient" is too wide a characterization to be accurate. If you ignore the differences in relative peformance offered by different Regex engines, how efficient your Regex is still depends upon the complexity of the pattern.

A Regex is efficient when it fails faster, rather than succeeds to match a portion of text faster. IOW, it needs to be specific enough so that text that isn't going to match completely should be rejected early on in the matching process rather than reaching the end. This also means that backtracking needs to be minimized. Note that backtracking can reach catastrophic levels which would result even the best Regex engine freezing up. In the example provided by Jan Goyvaerts, the size of the file isn't even mentioned and is not relevant. It is the size of the string which matters, however.

Here's a discussion thread I started on RegexAdvice.com a few years ago that attempted to get people discussing about performance of Regular expressions. It does have a few pointers to learn from, I guess.

Cerebrus
+1  A: 

I'm finding it very hard to read your regex with all the backslashes; if I'm interpreting this dialect of regex (vim?) correctly, what you're saying is what verbose pcre would spell:

(?<= <Annotation(\s*\w+="[^"]*?"\s*?)*> )
    ((?! <\/Annotation).)*?
    "MATCH.*?
(?= <\/Annotation> )

An obvious problem that jumps out is that it starts with a variable-length lookbehind assertion. Variable-length lookbehind is troublesome enough as it is (and not supported in many regex dialects), but then it's compounded by being followed by a liberal match (any character with negative lookahead).

This is hard for a processor to match; the combination pretty much kills any possible optimisation it might have been able to achieve. For every character in the file it has to backtrack (potentially to the beginning of file/end of previous match) to see if the lookbehind matches, then step forward (potentially to the end of file) looking for "MATCH.

You could help it by using only a fixed-size lookbehind, or simply forgetting about the lookarounds and including the whole string in the match, which will be more efficient to find, at the expense of needing a bit more post-processing to pick out the groups you're interested in.

<Annotation(\s+(\w+="[^"]))*\s*>
    (.*? "MATCH .*?)
<Annotation>

However, as always with this kind of question, Clippy sez:

It looks like you're parsing XML with regex.

Would you like help?

( ) Use a real XML parser because regex is inherently incapable of parsing XML

( ) Just soldier on making unreadable complex regexes which still don't work

[ ] Don't show me this tip again

bobince