tags:

views:

142

answers:

5
+3  Q: 

Regex comparison

I'm (finally) starting to learn regex, and I'm wondering if there's any notable difference between these two pattern strings. I'm trying to match lines such as "Title=Blah", and match "Title" and "Blah" in two groups.

The problem comes with titles like "Title=The = operator". Here are the two choices to solve the problem:

^([^=]+)=(.+)$
^(.+?)=(.+)$

Is there any difference between the two, either performance-wise or functionality-wise?

A: 

Run both against '==test'

Huh? Why was this response worthy of even one vote?
Jim G.
+5  A: 

The first one requires that there be at least one non-= character before an = in order to match, where the second doesn't; it'll match on a leading ==.

As to performance, I don't anticipate a meaningful difference, but if you truly care, the only thing to do is profile it. Which I would do by writing a pair of scripts, each running one of the methods a few hundred thousand times, and timing them using the Unix time command.

chaos
Yes, measuring and profiling is always best; but academically there's a much tighter answer.First of all, depending on the content, a lazy quantifier can make a big difference. If the second regex ran repeatedly against moderately sized HTML pages, then it would be noticeably slower.
Jim G.
And as Stephen C pointed out, this will matter for most regex engines, which are DFAs, whereas it will not matter for NDFA regex engines which comprise a much smaller subset.
Jim G.
+1  A: 

On the performance side, it will (in theory) depend on which implementation of regex you are using. While it is probably not the case here, there can profound differences between implementations for problematic regexes. For example, the regex a?a?a?aaa applied to a string consisting of N "a"s has complexity of O(N**3) using typical (i.e. DFA-based) regex engines.

For a more information refer to: "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" by Russ Cox.

Stephen C
+1  A: 

A great question, but sadly it will depend on the regex engine. You'll have to profile it to know the difference in runtime. Well, I suppose if you had the engine's source code then you could make the determination but I'm assuming this isn't the case.

Dinah
There's no need to parse through a regex engine's source code. That would be far too laborious.Regex engines tend to be one of several well-known flavors. You can check out Jeffrey Friedl's book for guidance. That would be far more productive than parsing through a regex engine's source code (which would be complicated to say the least).
Jim G.
Agreed. I was just making the point that the question was about the minutia that may be engine dependent. Although there are some constants, some of the finer points (those that would vary by engine) could be solved by examining said engine.
Dinah
+2  A: 

The first one requires that there be at least one non-= character before an = in order to match, whereas the second doesn't; it'll match on a leading ==.

Depending on your content, the first one could run significantly faster. Here's why:

An Alternative to Laziness
In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^=]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing...

Jim G.
@Jim: If the engine is based on NDFAs then it can decide whether to keep expanding the `.+?` based on single character lookahead; i.e. no backtracking. Admittedly, most regex engines >>are<< based on DFAs.
Stephen C
Stephen: You're exactly right, and your point is germane. For brevity's sake, I omitted that detail in my post because, as you said, most regex engines are DFAs. Interestingly enough, the page referenced in the hypertext mentions this point as well.
Jim G.