ansaurus

Question

Please explain this Perl regular expression

Answer 1

+19 A:

I find the YAPE::Regex::Explain module very helpful -

C:\>perl -e "use YAPE::Regex::Explain;print YAPE::Regex::Explain->new(qr/['-])->explain;"
The regular expression:

(?-imsx:['-])

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ['-]                     any character of: ''', '-'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------



C:\>perl -e "use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/(\w+), ?(.)/)->explain;"
The regular expression:

(?-imsx:(\w+), ?(.))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  ,                        ','
----------------------------------------------------------------------
   ?                       ' ' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

C:\>

Ed Guiness 2008-12-19 14:57:32

whoooa hold on what the hay is all this? i appreciate the help, but this is just looking and reading weird... whats with all the ---------------------?

CheeseConQueso 2008-12-19 15:00:10

nevermind... it just came up as pre code... last time i viewed it, it was regular formatted

CheeseConQueso 2008-12-19 15:00:57

It's output from YAPE::Regex that will look better on your command line. The point is that there is a neat tool to help explain regex.

Ed Guiness 2008-12-19 15:01:47

yeah that does look helpful

CheeseConQueso 2008-12-19 15:04:49

Answer 2

+1 A:

1st line: characters inside [] (' and -) are matched and replaced (s) by nothing, thus removed. /g means global and will try to match everything in the string.

2nd line: \w means a word character, + means more than once. ? means 0 or once. "." means anything. So it means find any word character found more than once, followed by a coma, followed by a space zero or once, followed by one of any character.

Loki 2008-12-19 14:58:18

Answer 3

+8 A:

I keep one of these cheat sheets pinned on my cube wall for just such occasions. Google for regular expression cheat sheet to find others.

To add to what you already know:

  g -- search globally throughout the string
  + -- match at least one, but as many as possible
  ? -- match 0 or 1
  . -- match any character
 () -- group these together
  , -- a plain comma, no special meaning
 [] -- match any character inside the brackets
 \w -- match any word character

The magic is in the grouping -- the match expression uses the groups and puts them into variables $1 and $2. In this case $1 matches the word before the comma and $2 matches the first character following the whitespace after the comma.

tvanfosson 2008-12-19 14:59:31

yeah, i promptly removed that from my "knowns" when i found out haha - foolish

CheeseConQueso 2008-12-19 15:07:48

ahh thattts how the $1 and $2 exist.. thanks

CheeseConQueso 2008-12-19 15:09:17

just a small addition, the whitespace after the comma is optional (due to the ?)

Dashogun 2008-12-19 15:29:19

@Dashogun. Correct, but his example has the whitespace in it.

tvanfosson 2008-12-19 17:48:17

Answer 4

+1 A:

$lhs =~ s/foo/bar/g;

The s/ operator is a modifying regexp in Perl - you match the LHS against the first part on the right (foo). The second part specifies the replacement for the match in the first part (bar). So "Lafooey" goes to "Labarey".

In your question, the aim is to remove all ' and - like in "O'Hanlon" and "Chalmonly-Witherington-Smyth".

Then it matches "Lastname, First character of firstname". The parentheses put the values of these matches into the variables $1 and $2.

And prints the lowercase of "F" + "Lastname", because these are the values in $2 and $1.

At the end of it, you have a viable username for a system based upon the person's real name from a telephone directory style listing.

JeeBee 2008-12-19 15:00:31

Answer 5

+1 A:

iirc the =~ means make equal to the match (cf "~" alone returning true if matched)

annakata 2008-12-19 15:00:38

Answer 6

+1 A:

The =~ matches the expression (string) on its left hand side against the regular expression on its right hand side, it does not modify the string. Asa side effect is set the variables $1, $2, ... to the bracketed parts matched.

In your case the first bracket will match "(\w+)" (word characters repeated one or more time, and the second will match "(.)" (the first letter of the given name. The " ?" expression will match an optional space.

Diomidis Spinellis 2008-12-19 15:02:02

Answer 7

+3 A:

Download "The Regex Coach" and explore it. Consider purchasing "Mastering Regular Expressions" as it will walk you through this minefield. It is one of the best-typeset books I've ever seen and is deeply informative yet penetrable.

2008-12-22 01:17:15

Answer 8

+1 A:

Note that the given code fails miserably if the input isn't in the right format. Here's what I would do:

$rowfetch =~ s/[ '-]//g; #All chars inside the [ ] will be filtered out.
if($rowfetch =~ m/(\w+),([a-z])/i) {
    printf $fh lc($2.$1);
}

the $1-$9 positional variables hold the last successful match, but they are not reset in the case of failed matches. This means if the regex fails to match, $1 and $2 will not be erased and you'll end up with something other than what you wanted.

I've also altered the regex slightly. The first line also removes spaces. Since it appears that you are creating usernames or email addresses, you don't want spaces. The second line is stricter to ensure that $2 is a letter, and not some other character. The 'i' at the end tells perl to make all letter matches case insensitive. With it , I don't have to make that second part ([a-zA-Z]).

2009-01-16 21:49:48

thanks... ill keep this in mind

CheeseConQueso 2009-01-20 18:06:49

Answer 9

+1 A:

There is a great web front end to YAPE::Regex::Explain.

Here is the explanation of s/['-]//g

and for m/(\w+), ?(.)/

drewk 2010-04-13 17:23:38

ansaurus

tags:

views:

answers:

Please explain this Perl regular expression

related questions