views:

92

answers:

3

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used to prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

Should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have a rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)

It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

+2  A: 

"Compiling" a regular expression simply means converting it from a Deterministic Finite Automaton to a Non-deterministic Finite Automaton. It isn't "compiling into machine code" as you might expect.

NFAs are typically smaller than their corresponding DFA and can execute more space efficiently. Every DFA has at least one equivalent NFA and vice-versa. However, Perl Compatible Regular Expressions are aren't actually Regular. So I don't know that they are converted into NFAs or if "compiling" just another form of lexical analysis that done once need not be done again.

PCREs are slow according to Russ Cox, partly because of their non-regularity, and your expression above is quite non-regular.

Oh, and if you are trying to recognize nested tags with regexps, don't. Use a real (X|HT)?ML parser. You really don't want the pony to come

msw
@msw: Sorry, I forgot to mention I use c#.See http://blogs.msdn.com/b/bclteam/archive/2004/11/12/256783.aspx?wa=wsignin1.0 to see exactly what the compiled option does in .NET. Russ Cox article is very interesting. Never had the time to fully study it, but I was aware of some of the key points and I understand that non-regularity is the reason for the regex being slow. My question could be rephrased as "Is there a way to get the same result using a regular expression that is actually regular?" Can you think of a regex that would produce the same result without all this backtracking?
Sylverdrag
PS: I assume it's a typo, but according to Russ' article, (and the definition) DFAs are more efficient than NFA (single choice instead of multiple choices.
Sylverdrag
There are multiple efficiencies; I added a word above to clarify; thanks. I'm plowing through the Cox paper now.
msw
+2  A: 

I would prepend a negative lookahead assertion to the regex to make sure that you can in fact match a \r\n in the current position. Otherwise, the engine has to do the entire lookbehind (arbitrary-size to boot) on every single character in the entire file, only to find out then that there is no carriage return to be replaced.

(?=\r\n)(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

should be a lot faster. At least in RegexBuddy, the regex engine needs a lot fewer steps to complete the match. If it doesn't in .NET, I don't know why. Perhaps the conditional regex is not as efficient (I must admit that at first I didn't recognize it and thought that there was a syntax error in your regex). I think you don't need a conditional regex in this scenario. How about

\r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*)

Is this faster? I'm assuming you're using RegexOptions.Singleline to compile the regex.

If not, well, then there probably are very many carriage returns and many other characters inside your <SOURCE> blocks, and the arbitrary-size lookbehind simply takes a long time. Then my other suggestion is to split up the task:

  1. Match the <SOURCE> block
  2. Replace all CRLFs inside the block (no regex required)
  3. Replace the <SOURCE> block
Tim Pietzcker
Thanks for the suggestion. (+1) It makes a lot of sense and I had high hopes, but I timed both with a stopwatch and the same text sample in Expresso. They both finished in exactly 21 seconds. For unknown reasons, the same operation on the same file takes 8 seconds in my program, again, regardless of whether I test it with a negative look-ahead or without. I assume this means that .NET already optimizes the regex before running it.
Sylverdrag
RE: Parser. The file format is not known in advance. My software's main specialty is to prepare text-based files which do not follow the rules or for which no known rules exist. I used XML-like tags for this example because it makes it very easy to understand what I want to do, but it could just about anything. I could write a parser to handle this specific function but there are many constructs of this type and I can't write a parser for every one of them and release a new version any time someone request a new snippet. As much as possible, I want to stick with regex.
Sylverdrag
Thanks for your update. \r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*) is a little bit better (18 seconds instead of 21), but I went with Alan's answer, as it is much faster. The non-backtracking subexpression (>) is a really neat trick.
Sylverdrag
+2  A: 

Try this:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

Alan Moore
Thanks a lot! Cuts down the time by half in Expresso and shaves about 2 seconds in my app!
Sylverdrag