views:

277

answers:

5

It seems that this is a huge source of confusion for beginners writing regular expressions, can cause hidden performance problems, and it would seem that a typical use case would be non-greedy.

Is this just for legacy reasons (it was how it was first done, and every implementation copies that), or is there a reason for it?

+1  A: 

Possible reason: The regex engine needs to backtrack a lot if it's non-greedy.

KennyTM
@roe: Yes, both quantifier behaviors can require backtracking.
Gumbo
+4  A: 

Well, it is important that computers behave predictably whenever possible. So the correct behavior should follow a simple rule, like greedy matching, so that at least experienced programmers can predict the outcome of a piece of code.

As for whether a typical use case should be non-greedy, what about the following: suppose I have a file with entries like foo1909, bar3939, baz3331, and I just want to extract these numbers. It seems natural enough to write (\d*) as the regular expression for this.

You might say that it is just as easy to write (\d*)\D or whatever, but it is basically always the case that the programmer can be more explicit and less ambiguous. Since we wanted a default behavior that was 100% predictable, and trivial to calculate in ones head, it seems reasonable to me.

forefinger
This is a perfectly logical and reasonable guess, however, it's quite unrelated to the real reason, which is simply that non-greedy came much, much, later and so it was not the default.
DigitalRoss
+2  A: 

In the case of performance, lazy quantifiers aren't always faster because of backtracking: http://blog.stevenlevithan.com/archives/greedy-lazy-performance

As for the actual design, I honestly can't say why quantifiers are greedy by default but I do wonder what control character would have been used to make a quantifier greedy instead of lazy. I don't think ? would have cut it :-)

Andy E
Obviously it would be $
forefinger
@forefinger: Doesn't that match the end of the string/line?
Andy E
It does. It just seems like the best greedy symbol.
forefinger
+6  A: 

Hysterical Raisens


Part of the answer may involve the origins of REs in practical computing. They were originally a theoretical concept from automata theory and formal language theory until Ken Thompson himself wrote a real implementation and used them in qed and ed(1).

The original version had only the greedy syntax and so there wasn't a decision to even make.

DigitalRoss
I'm not so sure you can say that theoretical regular languages are greedy by default. I think a Kleene regular expression defines a set of all the strings that could match it, so `/x*/` could match "" or "x" or "xxx" (etc.). Such an expression defines a regular language including the strings "", "x" and "xxx". Note that this says nothing about how to go about searching for matches in a body of text; it's only when you apply the theory that you start caring about greediness.
Nate C-K
Sure, sure, by *"original version"* I just meant *"as Ken Thompson typed it in for those editors",* and in those versions and for almost a decade afterwards, ed, grep, ex, and vi only did greedy pattern matching.
DigitalRoss
Ah, in that case we're in agreement.
Nate C-K
+1  A: 

The real issue here is the Kleene closure operator (star); for everything else in a regular expression, the longest match is the same as the shortest match.

When you think about it in those terms, you realize that more modern tools realize you need both. I'm up late so I can think of only two examples:

  • Both ksh and bash provide "longest match" and "shortest match" forms of most of the special variable-altering operators.

  • The Lua regular expressions include * for Kleene closure longest match and - for Kleene closure shortest match. This one always bites me when I forget to escape a literal - sign.

It would be interesting to go back to Kleene's original work and see if that might have influenced early tools toward longest match.

Norman Ramsey
What about alternation? Given the regex `/foo|foobar/` and the target string `blahfoobarblah`, a mathematically-pure regular expression will always match `foobar`, while a Perl-derived, NFA regex will settle for `foo`.
Alan Moore