ansaurus

Question

Confusion in RegExp Reluctant quantifier? Java

Answer 1

+4 A:

*? matches zero or more matches, but as few as possible (and by the way, that’s usually called “non-greedy”, not “reluctant”). So if zero matches is possible, that’s the optimal match.

What exactly do you want to achieve? Maybe non-greedy matching isn’t what you need.

Konrad Rudolph 2010-05-14 11:23:03

Java does call it "reluctant". http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

polygenelubricants 2010-05-14 11:47:25

I think "reluctant" is the best word for them, although "non-greedy" is good too. But "lazy" (the other most common descriptor) is misleading IMO; they actually work harder than either greedy or possessive quantifiers.

Alan Moore 2010-05-14 11:57:34

@Alan Moore: I agree, “lazy” usually is something else in CS (see lazy evaluation).

Konrad Rudolph 2010-05-14 12:06:54

Answer 2

+3 A:

In addition to Konrad Rudolph's answer:

abc*?

matches "ab" in any case and "c" only if it must. Since nothing follows the *?, the regex engine stops immediately. If you had:

abc*?f

then it would match "abcf" be cause the "c" must match in order to allow the "f" to match, too. The other expression:

.*?

matches nothing because this pattern is 100% optional.

.*?f

would match "abcf" again.

Tomalak 2010-05-14 11:38:02

Answer 3

+1 A:

It never makes sense to have a reluctant quantifier as the last thing in a regex. A reluctant quantifier matches only as much as it has to in order to achieve an overall match. That means there has to be something after the quantifier to force it to keep matching.

If it seems odd to have something that can be put such a pointless use, it's probably because reluctant quantifiers are an add-on--something that's not possible with "real" regular expressions. Some other examples of pointless usage are the "quantifier" {1}, and \b+ or any other zero-width assertion (^, $, lookarounds, etc.) with a quantifier. Some flavors treat the latter as a syntax error; Java allows it, but of course only applies the assertion once.

Alan Moore 2010-05-14 11:53:13

Answer 4

+1 A:

The ? reluctant quantifier makes .* match as few characters as possible, only matching more character if it's required by backtracking.

Here's an illustrative example of using regex to find a non-empty prefix that is also a suffix of a string (no overlapping).

The capturing group \1 in the first pattern is greedy: it first matches everything, and takes as less as it backtracks. As such, the pattern will find the longest possible prefix/suffix match:

    System.out.println(
        "abracadabra".replaceAll("^(.+).*\\1$", "($1)")
    ); // prints "(abra)"

Now \1 in the second pattern is reluctant; it first matches nothing, and takes more as it backtracks. As such, the pattern will find the shortest prefix/suffix match:

    System.out.println(
        "abracadabra".replaceAll("^(.+?).*\\1$", "($1)")
    ); // prints "(a)"

In your case, the .*? can match an empty string, and never needed to backtrack and match more since it was enough for the overall pattern to match.

ansaurus

tags:

views:

answers:

Confusion in RegExp Reluctant quantifier? Java

See also

related questions