views:

666

answers:

6

I have always written regexes like this

<A HREF="([^"]*)" TARGET="_blank">([^<]*)</A>

but I just learned about this lazy thing and that I can write it like this

<A HREF="(.*?)" TARGET="_blank">(.*?)</A>

is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).

Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.

Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)

+1  A: 

“lazy” is the wrong word here. You mean non-greedy as opposed to greedy. There's no disadvantage in using it, that I know of. But in your special case, neither should it be more efficient.

Konrad Rudolph
Thanks for your answer. These guys http://www.regular-expressions.info/repeat.html refer to lazy or greedy, which I admit makes less sense than greedy and non-greedy.
Yar
It might interest you to know that "those guys" is actually Jan Goyvaerts, an SO member. ;)
Alan Moore
Yeah, I really can't complain about the quality of SO memeber's. The last time I used a technical forum with this level of responses was the xSLT forum, and a famous guru named David Carlile (something like that) answered most of the questions personally.
Yar
+3  A: 

It's not about better or worse. The term I've seen the most is greedy vs. non-greedy, but however you put they do two different things. You want to use the right one for the task. I.e. turn off the greedy option when you don't want to capture multiple matches in a line.

Brian Rasmussen
+1  A: 

Non-greedy is better, is it not? It works forward, checking for a match each time and stopping when it finds one, whereas the normal kleene closure (*) works backwards matching the rest of the input and removing things until it finds a match.

In the end, they do different things, but I think non-greedy outperforms greedy. Bear in mind that I haven't tested this, but now I'm curious.

eplawless
bet it's implementation dependent. Thanks for your answer!
Yar
+6  A: 

Note that your examples are not equivalent. Your first regular expression will not select any links that contain other tags, such as img or b. The second regular expression will, and I expect that's probably what you wanted anyway.

Besides the difference in meaning, the only disadvantage I can think of is that support for non-greedy modifiers isn't quite as prevalent as character-class negation is. It's more widely supported than I thought, before I checked, but notably absent from the list is GNU Grep. If the regular-expression evaluators you're using support it, then go ahead and use it.

Rob Kennedy
Hi Rob, it's true, I do want ANYTHING that can go between the A tags. Whether my regex evaluator supports it... wow, I didn't even know that it couldn't. I'll have to check (I'm in AS3) and I'll edit the question with that.
Yar
I'm changing the best answer to the other best answer (ysth)... you are both right on why the two expressions are different. But you already got 5 points (for mentioning GNU?), so I'll give the best answer to ysth...
Yar
+5  A: 

The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.

The non greedy regex will match things you probably don't want, such as:

<A HREF="foo" NAME="foo" TARGET="_blank">foo</A>

where your first .*? matches

foo" NAME="foo
ysth
I don't get your last remark. In your opinion, what would be matched here and why would it be different from what we want?
Konrad Rudolph
Doesn't the first .*? match as few characters as possible before matching the double quote, thus only matching foo?
Kenny
ysth: I now see your point, i.e. that the arguments are reordered.
Konrad Rudolph
It's hard to say what I would want to match in that case because it's not legal HTML (or at least makes no sense to me).
Yar
Kenny: no, .*? will first try matching through the first double quote, but if that doesn't permit a successful match, it will go on to the second double quote, etc.
ysth
@Daniel Rosenstark, I'm changing the example to be legal.
ysth
Ah I had the same trouble as Konrad, thanks for clearing that up ysth.
Kenny
Wow! I had not thought about that. You're totally right... if it's not precisely the regex that I'm looking for, it gets in the match. For my current case, I'm okay because the code looks JUST like I need, but in general I now see the difference. Problem is I can't mark two best answers...
Yar
+7  A: 

Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:

/<BODY>.*?<\/BODY>/is

But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.

In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.

Alan Moore