tags:

views:

205

answers:

7

If I run

"Year 2010" =~ /([0-4]*)/;
print $1;

I get empty string. But

"Year 2010" =~ /([0-4]+)/;
print $1;

outputs "2010". Why?

+17  A: 

You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.

Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.

Jon Skeet
Great, thank you!
alexanderkuk
Kleene star-generated superset contains also empty string, so yes, it will match empty string before Y, e, a, r, whitespace, and then it will find 2010.
raceCh-
+5  A: 

The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.

eumiro
+6  A: 

The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.

The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.

Mark Byers
+5  A: 

you can also use YAPE::Regex::Explain for explanation of a regular expression like

use YAPE::Regex::Explain;

print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();

output:

The regular expression:
(?-imsx:([0-4]*))
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [0-4]*                   any character of: '0' to '4' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

The regular expression:
(?-imsx:([0-4]+))
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [0-4]+                   any character of: '0' to '4' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Nikhil Jain
+1  A: 

The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.

The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.

It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)

raceCh-
A: 

To make your first RE match, use the anchor '$':

"Year 2010" =~ /([0-4]*)$/;
 print $1;
+1  A: 

We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.

The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.

brian d foy