ansaurus

Question

Answer 1

+2 A:

"Compiling" a regular expression simply means converting it from a Deterministic Finite Automaton to a Non-deterministic Finite Automaton. It isn't "compiling into machine code" as you might expect.

NFAs are typically smaller than their corresponding DFA and can execute more space efficiently. Every DFA has at least one equivalent NFA and vice-versa. However, Perl Compatible Regular Expressions are aren't actually Regular. So I don't know that they are converted into NFAs or if "compiling" just another form of lexical analysis that done once need not be done again.

PCREs are slow according to Russ Cox, partly because of their non-regularity, and your expression above is quite non-regular.

Oh, and if you are trying to recognize nested tags with regexps, don't. Use a real (X|HT)?ML parser. You really don't want the pony to come

msw 2010-07-07 11:19:44

@msw: Sorry, I forgot to mention I use c#.See http://blogs.msdn.com/b/bclteam/archive/2004/11/12/256783.aspx?wa=wsignin1.0 to see exactly what the compiled option does in .NET. Russ Cox article is very interesting. Never had the time to fully study it, but I was aware of some of the key points and I understand that non-regularity is the reason for the regex being slow. My question could be rephrased as "Is there a way to get the same result using a regular expression that is actually regular?" Can you think of a regex that would produce the same result without all this backtracking?

Sylverdrag 2010-07-07 15:07:49

PS: I assume it's a typo, but according to Russ' article, (and the definition) DFAs are more efficient than NFA (single choice instead of multiple choices.

Sylverdrag 2010-07-07 15:15:02

There are multiple efficiencies; I added a word above to clarify; thanks. I'm plowing through the Cox paper now.

msw 2010-07-07 20:31:16

Answer 2

+2 A:

I would prepend a negative lookahead assertion to the regex to make sure that you can in fact match a \r\n in the current position. Otherwise, the engine has to do the entire lookbehind (arbitrary-size to boot) on every single character in the entire file, only to find out then that there is no carriage return to be replaced.

(?=\r\n)(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

should be a lot faster. At least in RegexBuddy, the regex engine needs a lot fewer steps to complete the match. If it doesn't in .NET, I don't know why. Perhaps the conditional regex is not as efficient (I must admit that at first I didn't recognize it and thought that there was a syntax error in your regex). I think you don't need a conditional regex in this scenario. How about

\r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*)

Is this faster? I'm assuming you're using RegexOptions.Singleline to compile the regex.

If not, well, then there probably are very many carriage returns and many other characters inside your <SOURCE> blocks, and the arbitrary-size lookbehind simply takes a long time. Then my other suggestion is to split up the task:

Match the <SOURCE> block
Replace all CRLFs inside the block (no regex required)
Replace the <SOURCE> block

Tim Pietzcker 2010-07-07 11:31:01

Thanks for the suggestion. (+1) It makes a lot of sense and I had high hopes, but I timed both with a stopwatch and the same text sample in Expresso. They both finished in exactly 21 seconds. For unknown reasons, the same operation on the same file takes 8 seconds in my program, again, regardless of whether I test it with a negative look-ahead or without. I assume this means that .NET already optimizes the regex before running it.

Sylverdrag 2010-07-07 13:57:04

RE: Parser. The file format is not known in advance. My software's main specialty is to prepare text-based files which do not follow the rules or for which no known rules exist. I used XML-like tags for this example because it makes it very easy to understand what I want to do, but it could just about anything. I could write a parser to handle this specific function but there are many constructs of this type and I can't write a parser for every one of them and release a new version any time someone request a new snippet. As much as possible, I want to stick with regex.

Sylverdrag 2010-07-07 14:17:43

Thanks for your update. \r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*) is a little bit better (18 seconds instead of 21), but I went with Alan's answer, as it is much faster. The non-backtracking subexpression (>) is a really neat trick.

Sylverdrag 2010-07-07 18:25:10

Answer 3

+2 A:

Try this:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

Alan Moore 2010-07-07 15:49:05

Thanks a lot! Cuts down the time by half in Expresso and shaves about 2 seconds in my app!

Sylverdrag 2010-07-07 17:10:48

ansaurus

tags:

views:

answers:

Improve performance of a regexp

related questions