tags:

views:

114

answers:

4

Hi, Could anyone please tell me the reason of getting an output as: ab for the following RegExp code using Relcutant quantifier?

    Pattern p = Pattern.compile("abc*?");
    Matcher m = p.matcher("abcfoo");
while(m.find())
   System.out.println(m.group()); // ab

and getting empty indices for the following code?

   Pattern p = Pattern.compile(".*?");
   Matcher m = p.matcher("abcfoo");
   while(m.find())
   System.out.println(m.group());
+4  A: 

*? matches zero or more matches, but as few as possible (and by the way, that’s usually called “non-greedy”, not “reluctant”). So if zero matches is possible, that’s the optimal match.

What exactly do you want to achieve? Maybe non-greedy matching isn’t what you need.

Konrad Rudolph
Java does call it "reluctant". http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
polygenelubricants
I think "reluctant" is the best word for them, although "non-greedy" is good too. But "lazy" (the other most common descriptor) is misleading IMO; they actually work harder than either greedy or possessive quantifiers.
Alan Moore
@Alan Moore: I agree, “lazy” usually is something else in CS (see lazy evaluation).
Konrad Rudolph
+3  A: 

In addition to Konrad Rudolph's answer:

abc*?

matches "ab" in any case and "c" only if it must. Since nothing follows the *?, the regex engine stops immediately. If you had:

abc*?f

then it would match "abcf" be cause the "c" must match in order to allow the "f" to match, too. The other expression:

.*?

matches nothing because this pattern is 100% optional.

.*?f

would match "abcf" again.

Tomalak
+1  A: 

It never makes sense to have a reluctant quantifier as the last thing in a regex. A reluctant quantifier matches only as much as it has to in order to achieve an overall match. That means there has to be something after the quantifier to force it to keep matching.

If it seems odd to have something that can be put such a pointless use, it's probably because reluctant quantifiers are an add-on--something that's not possible with "real" regular expressions. Some other examples of pointless usage are the "quantifier" {1}, and \b+ or any other zero-width assertion (^, $, lookarounds, etc.) with a quantifier. Some flavors treat the latter as a syntax error; Java allows it, but of course only applies the assertion once.

Alan Moore
+1  A: 

The ? reluctant quantifier makes .* match as few characters as possible, only matching more character if it's required by backtracking.

Here's an illustrative example of using regex to find a non-empty prefix that is also a suffix of a string (no overlapping).

The capturing group \1 in the first pattern is greedy: it first matches everything, and takes as less as it backtracks. As such, the pattern will find the longest possible prefix/suffix match:

    System.out.println(
        "abracadabra".replaceAll("^(.+).*\\1$", "($1)")
    ); // prints "(abra)"

Now \1 in the second pattern is reluctant; it first matches nothing, and takes more as it backtracks. As such, the pattern will find the shortest prefix/suffix match:

    System.out.println(
        "abracadabra".replaceAll("^(.+?).*\\1$", "($1)")
    ); // prints "(a)"

In your case, the .*? can match an empty string, and never needed to backtrack and match more since it was enough for the overall pattern to match.

See also


Here's another illustrative example of reluctant quantifier on a finite repetition:

Here, x{3,5} is greedy, and will take as much as possible.

    System.out.println(
        "xxxxxxx".replaceAll("x{3,5}", "Y")
    ); // prints "Yxx"

Here, x{3,5}? is reluctant, and will take as few as possible.

    System.out.println(
        "xxxxxxx".replaceAll("x{3,5}?", "Y")
    ); // prints "YYx"
polygenelubricants