tags:

views:

239

answers:

4

After reading this article http://www.codinghorror.com/blog/archives/000228.html I understand the benefits of compiled regular expressions a little better, however in what personal scenarios would you consider mandates the use of a compiled Reg Ex?

For instance, I am using a regex in a loop and the regular expression string utilises different variables each iteration, so I would seek no improvement by flagging this regex as compiled right?


Hi thanks for your answers, my actual code is not straightforward and is compromised of an RE built on the fly so I cannot include it, so for all intensive purposes, here is an example which demonstrates my approach:

foreach (field field in fields.Where(x => x.condition))
    MatchResults = Regex.Match(request.Message, field.RegularExpression);
...
+2  A: 

I would compile the RE when it has to be used more than two or three times and the cost of compiling is more than offset by the improvements in execution time of the result.

I never compile one-off REs and I always compile those that are executed more than five times (give or take a couple) but I've never found a need for parameterized REs (that need may exist, it's just I've never found it) so that doesn't come into it.

EDIT: That article you refer to states that up-front compiling is an order of magnitude slower than interpretation (ten times) yest only saves 30%. And, in addition, interpreted REs are cached anyway. So I would say it's definitely arguing against the casual use of compiling.

A 30% saving means it would take 100/3 (about 33) executions of the compiled RE to recover the initial cost of compilation. That's according to th MSDN doco on .NET - I've always assumed in my REs (Python/Perl/Java) it wouldn't be that bad but I guess I should check.

paxdiablo
My problem is, my RegEx is potentially executed hundreds of times, but then match string is constantly slightly altered.
GONeale
If by match string you mean the input to the RE, compilation is still better. If you mean a change to the RE itself, you're effectively compiling the RE every time any way so don't bother.
paxdiablo
Yep that's right, changing the RE itself. That's what I thought, I shouldn't bother.
GONeale
See edit: I definitely wouldn't with an order-of-magnitude cost.
paxdiablo
Very interesting - Especially how it would take around 33 executions to recoup the costs.
GONeale
100 units of time to run the interpreted RE. An order of magnitude slower for compiling is 1000 units. The compiled RE is 30% faster so takes 30 units less each time than the interpreted. In order to recover that 1000 units, you need to save about 33 * 30 (999) units.
paxdiablo
Sorry, @GO, I thought that last comment of yours was a question ("how?") rather than a statement ("especially how..."). But I'll leave my last comment in anyway to explain my rough-as-guts calculation.
paxdiablo
A: 

It sounds to me like you're being too specific with your expression. I'd be interested to see a code example of what you're actually trying to parse because my gut tells me you're approach may not be generic enough. If that's no the case, a set of expressions could also be pre-compiled an each compared during the loop, for example.

Please edit your question and add some code so we can help you further.

Soviut
A: 

Compiling a regex should only be undertaken when the regex is sufficiently complex. Simple regex expressions will execute more efficiently uncompiled as the time to compile will add to the overhead unnecessarily. If your regex expression is highly complex but is only used once then you should evaluate whether or not it will benefit from compilation. You can measure this by setting up a routine that times the two alternatives.

In almost every case where the regex statement is used multiple times it is worth compiling the regex outside the loop.

nullnvoid
+3  A: 

In .NET, there are two ways to "compile" a regular expression. Regular expressions are always "compiled" before they can be used to find matches. When you instantiate the Regex class without the RegexOptions.Compiled flag, your regular expression is still converted into an internal data structure used by the Regex class. The actual matching process runs on that data structure rather than string representing your regex. It persists as long as your Regex instance lives.

Explicitly instantiating the Regex class is preferable to calling the static Regex methods if you're using the same regex more than once. The reason is that the static methods create a Regex instance anyway, and then throw it away. They do keep a cache of recently compiled regexes, but the cache is rather small, and the cache lookup far more costly than simply referencing a pointer to an existing Regex instance.

The above form of compilation exists in every programming language or library that uses regular expressions, though not all offer control over it.

The .NET framework provides a second way of compiling regular expressions by constructing a Regex object and specifying the RegexOptions.Compiled flag. Absence or presence of this flag does not indicate whether or not the regex is compiled. It indicates whether the regex is compiled quickly, as described above, or thoroughly, as described below.

What RegexOptions.Compiled really does is to create a new assembly with your regular expression compiled down to MSIL. This assembly is then loaded, compiled to machine code, and becomes a permanent part of your application (while it runs). This process takes a lot of CPU ticks, and the memory usage is permanent.

You should use RegexOptions.Compiled only if you're processing so much data with it that the user actually has to wait on your regex. If you can't measure the speed difference with a stopwatch, don't bother with RegexOptions.Compiled.

Jan Goyvaerts
Quite comprehensive.
GONeale