How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:
'cat'
'' (ie, nothing written, just the empty string)
How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:
'cat'
'' (ie, nothing written, just the empty string)
+? (lazy plus)
Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
/".+?"/
matches "def" (and "ghi") in abc "def" "ghi" jkl, while /".+"/
matches "def" "ghi".
You can find more info here
The "+?" is not a "+" quantifier followed by a "?" quantifier. Instead the "?" modifies the "+" to perform a "lazy" or "non greedy" match, meaning that the least number of characters that match is already sufficient.
So a "a+?" regex would match just a single "a" in "caaat".
Besides what Hans Kesting already said, a lazy multiplier will do the exact oposite of the normal greedy multipliers: The possible match is kept as small as possible and the rest of the regular expression is tested.
So if you’re having the string aaba
and test the regular expression a.*b
on it, the internal processing steps would be as follows:
a
in a
.*b
matches a
aba
.*
in a
.*
b
matches a
a
ba
, and since .*
is greedy
.*
then matches a
ab
a
.*
then matches a
aba
b
in a.*
b
fails as there is no letter left
.*
will now only match bb
in a
ab
a
b
in a.*
b
still fails on aab
a
.*
now matches only b
in a
a
ba
b
in a.*
b
now matches b
in aa
b
a
and we’re done.So the full match is aab
a
.
If we do the same with a lazy multiplier (a.*?b
), the processing will do the oposite, try to match the least possible characters as possible:
a
in a
.*?b
matches a
aba
.*
in a
.*
?b
matches nothing (*
= zero or more repetitions), and since .*
is declared as lazy (.*?
), the rest of the regular expression is testedb
in a.*?
b
fails on a
a
ba
.*
.*
matches now a
a
ba
b
in a.*?
b
matches aa
b
a
and we’re done.So the full match if aab
a
.
There is documentation on how Perl handles these quantifiers perldoc perlre
.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?
". Note that the meanings don't change, just the "greediness":By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*? Match 0 or more times, not greedily +? Match 1 or more times, not greedily ?? Match 0 or 1 time, not greedily {n}? Match exactly n times, not greedily {n,}? Match at least n times, not greedily {n,m}? Match at least n but not more than m times, not greedilyFor instance,
*+ Match 0 or more times and give nothing back ++ Match 1 or more times and give nothing back ?+ Match 0 or 1 time and give nothing back {n}+ Match exactly n times and give nothing back (redundant) {n,}+ Match at least n times and give nothing back {n,m}+ Match at least n but not more than m times and give nothing backwill never match, as the
'aaaa' =~ /a++a/a++
will gobble up all thea
's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:as we know that if the final quote does not match, backtracking will not help. See the independent subexpression
/"(?:[^"\\]++|\\.)*+"/(?>...)
for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/