tags:

views:

391

answers:

4

Why doesn't the first print statement output what I expect:

first = This is a test string, sec = This is a test string

Since both * and + are greedy, why does the the inner * i.e. inside the "((" in the first match not consuming the entire string?

use strict;
use warnings;

my $string = "This is a test string";
$string =~ /((.*)*)/; 
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec ="

$string =~ /((.+)*)/;
print "first = $1, sec = $2\n";  #prints "first = This is a test string, sec = This is a test string"
+15  A: 

In the first regex .* is matched two times. The first time it matches the whole string. The second time it matches the empty string at the end, because .* matches the empty string when there is nothing else to match.

This does not happen with the other regex because .+ can't match the empty string.

Edit: As to what goes where: $2 will contain what is matched the last time .* / .+ are applied. $1 will contain what is matched by (.*)* / (.+)*, i.e. the whole string.

sepp2k
Right. But since the brackets surround the string - I would expect the whole thing inside the brackets to be $2 (and not just the *)
Anna
So where does the outer () match end up? From your description I would've guessed $3, but it didn't go there.
Joe
@Anna Groups are counted by the opening parenthesis, so $1 is the whole string, where $2 is the inner set of parentheses.
Peter Di Cecco
The inner match is actually `$2`, the outer is `$1`. When the inner part matches a second time it "overwrites" the captured output from the first time it matched. If the inner `(.*)` matches several times, only the last match is preserved as `$2`.
sth
@sepp2k, since the regex match starts from the inner *, it has to consume the entire string. So, why $2 is empty in the first case?
chappar
Because after matching the whole string, it can still match the empty string. You can always match the empty string. It's everywhere.
sepp2k
@sepp2k, thanks a lot
chappar
+13  A: 
Brad Gilbert
@Brad, how do i read this? Is there any documentation?
chappar
See "perldoc re", or perldoc.perl.org/re.html, which leads you to "perldoc perldebug" and http://perldoc.perl.org/perldebug.html#Debugging-regular-expressions.
Ether
Telemachus
This would be a better answer if it included some commentary on what we're supposed to notice about those two outputs.
Rob Kennedy
I highlighted the differences, hopefully it is easier to figure out.
Brad Gilbert
+3  A: 

The problem with the first regex is a combination of the fact that ()* only saves the last match and .* matches an empty string (i.e. nothing). So, given

"aaab" =~ /(.)*/;

$1 will be "b". If you combine that behavior with the fact that .* matches an empty string, you can see that there are two matches of the inner capture: "This is a test string" and "". Since the empty string came last it gets saved to $2. $1 is the whole capture, so it is equivalent to "This is a test string" . "". The second case works as you expect it to because .+ will not match an empty string.

Chas. Owens
+3  A: 

I don't have an answer, but I do have different way of framing the issue, using simpler and perhaps more realistic regular expressions.

The first two examples behave exactly as I expect: .* consumes the entire string and the regular expression returns a list with only one element. But the third regular expression returns a list with 2 elements.

use strict;
use warnings;
use Data::Dumper;

$_ = "foo";
print Dumper( [ /^(.*)/g ] ); # ('foo')     As expected.
print Dumper( [ /.(.*)/g ] ); # ('oo')      As expected.
print Dumper( [ /(.*)/g  ] ); # ('foo', '') Why?

Many of the answers so far have emphasized that .* will match anything. While true, this response does not go to the heart of the matter, which is this: Why is the regular expression engine still hunting after .* has consumed the entire string? Under other circumstances (such as the first two examples), .* does not throw in an extra empty string for good measure.

Update after the useful comments from Chas. Owens. The first evaluation of any of the three examples results in .* matching the entire string. If we could intervene and call pos() at that moment, the engine would indeed be at the end of the string (at least as we perceive the string; see the comments from Chas. for more insight on this). However, the /g option tells Perl to try to match the entire regex again. That second attempt will fail for examples #1 and #2, and that failure will cause the engine to stop hunting. However, with regex #3, the engine will get another match: an empty string. Then the /g option tells the engine to try the entire pattern yet again. Now there really is nothing left to match -- neither regular characters nor the trailing empty string -- so the process stops.

FM
Imagine you are the regex engine. You have been instructed to match anything, so you start at "F", you see that you can add "o" and still match, you see that you can add "o" and still match, there are no more characters to match so you complete the match, the g option causes you to see if there is another match after the first, so you look at the empty string that is left. The empty string matches, so you return it and then stop.
Chas. Owens
@Chas. I'm probably being dense, but why wouldn't the `/g` option have the same effect on the first two examples?
FM
Yes, it does, but because the first example is anchored at the start of the string the empty string at the end can't match (and is therefore not returned). In the second case, the match requires the existence of at least one character in the match, so it can't match an empty string.
Chas. Owens
Unanchored zero-or-more matches are generally confusing until you get the hang of them: `perl -le 'print for "ababa" =~ /a*/g'`, Here we get six matches, one for the first a, then another for the empty string between the a and the b. Then the second a, then another empty string. Finally it will match the third a, and then the empty string between the a and the end of the string. This is why it is generally a bad idea to have unanchored zero-or-more matches.
Chas. Owens
The reason we need the match-empty-string behavior is more apparent when we look at `"aa" =~ /a.*a/`. In order for this to match, the first a must match the first a, the second a must match the second a, and the `.*` must match the empty string between them. A string (as seen by the regex engine) is really just a list of characters separated by empty strings and the regex `/ab/` matches an a followed by an empty string, followed by a b.
Chas. Owens
@Chas. Thanks, that helps a lot.
FM
An empty string can still only be consumed once. So it matches up to the end of the string, matches the empty string end of the string, then there is nothing left to match so it stops.
Chas. Owens