tags:

views:

215

answers:

5

How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:

'cat'
'' (ie, nothing written, just the empty string)

A: 

I suggest you looking at http://www.regular-expressions.info/

seize
Why was this down voted?
rahul
If this had an explanation in addition to the link I could upvote it.
tvanfosson
If you downvote me for being lazy, downvote who made the question for being more lazy then...
seize
actually, i'm not lazy; i searched google before i posted on here and couldn't find a clear result.
hatorade
You could've at least linked to the page that deals with lazy quantifiers: http://www.regular-expressions.info/repeat.html
Alan Moore
+4  A: 

+? (lazy plus)

Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.

/".+?"/ matches "def" (and "ghi") in abc "def" "ghi" jkl, while /".+"/ matches "def" "ghi".

You can find more info here

rahul
The quotes in the original example confused me at first -- I thought they were just setting the pattern off from the text, not part of the pattern. I think adding slashes clarifies the intent.
tvanfosson
Thanks @tvanfosson for providing the necessary edit.
rahul
as Svish apparently edited, /".+?"/ would match both "def" and "ghi" right? Separately. So it would return two expressions. Whereas /".+"/ catches the entire phrase "def" "ghi" counting def" "ghi as the .+ part?
hatorade
@hatorade, as far as I could see testing it in RegexBuddy, yes.
Svish
@Svish i tested in python and .+? seems to just return "def" and didn't mention catching "ghi" but that's probably because i need something else in my regex to indicate i want all the matches of the .+?
hatorade
+8  A: 

The "+?" is not a "+" quantifier followed by a "?" quantifier. Instead the "?" modifies the "+" to perform a "lazy" or "non greedy" match, meaning that the least number of characters that match is already sufficient.

So a "a+?" regex would match just a single "a" in "caaat".

Hans Kesting
Hans: the correct term is "quantifier" not "multiplier". HTH
Stephen C
+4  A: 

Besides what Hans Kesting already said, a lazy multiplier will do the exact oposite of the normal greedy multipliers: The possible match is kept as small as possible and the rest of the regular expression is tested.

So if you’re having the string aaba and test the regular expression a.*b on it, the internal processing steps would be as follows:

  1. a in a.*b matches aaba
  2. .* in a.*b matches aaba, and since .* is greedy
    1. .* then matches aaba
    2. .* then matches aaba
  3. b in a.*b fails as there is no letter left
    • backtracking goes one step back and .* will now only match bb in aaba
  4. b in a.*b still fails on aaba
    • backtracking goes one step back and .* now matches only b in aaba
  5. b in a.*b now matches b in aaba and we’re done.

So the full match is aaba.

If we do the same with a lazy multiplier (a.*?b), the processing will do the oposite, try to match the least possible characters as possible:

  1. a in a.*?b matches aaba
  2. .* in a.*?b matches nothing (* = zero or more repetitions), and since .* is declared as lazy (.*?), the rest of the regular expression is tested
  3. b in a.*?b fails on aaba
    • backtracking will try to increase the match of .*
  4. .* matches now aaba
  5. b in a.*?b matches aaba and we’re done.

So the full match if aaba.

Gumbo
So a.*b would catch BOTH aab and ab, correct? But a.*?b only catches aab? (using your aaba example for both scenarios). So .*? isn't necessarily trying to find the LEAST possibly characters, but the first instance of '.' being 1 letter, and if it matches, done. So once a.*?b catches aab, it wouldn't go on to try catching just ab.
hatorade
@hatorade: No, sorry. Both expressions match only **`aab`**`a`.
Gumbo
@Gumbo: I don't see why .*? wouldn't catch ab. a.*?b clearly matches ab. it makes sense that .* catches aab because it is greedy and tries to get the max fit. but isn't that the point of lazy plus?
hatorade
@hatorade: The match is performed on the whole string and not just a substring. So **`aab`**`a` is the only match for both regular expressions.
Gumbo
@Gumbo: well, apparently it does work as you say according to python. guess i just need to wrap my head around it.
hatorade
+1  A: 

There is documentation on how Perl handles these quantifiers perldoc perlre.

By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
    *?     Match 0 or more times, not greedily
    +?     Match 1 or more times, not greedily
    ??     Match 0 or 1 time, not greedily
    {n}?   Match exactly n times, not greedily
    {n,}?  Match at least n times, not greedily
    {n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
    *+     Match 0 or more times and give nothing back
    ++     Match 1 or more times and give nothing back
    ?+     Match 0 or 1 time and give nothing back
    {n}+   Match exactly n times and give nothing back (redundant)
    {n,}+  Match at least n times and give nothing back
    {n,m}+ Match at least n but not more than m times and give nothing back
For instance,
   'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
   /"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
   /"(?>(?:(?>[^"\\]+)|\\.)*)"/

link

Brad Gilbert