views:

94

answers:

3

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:

 <dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
 <dd>
   <p>
     [Content]
   </p>
 </dd>
 ... and so on.

and here's the example script I'm studying:

#!/usr/bin/perl -w

use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;

my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);

$rss->channel(title       => "The more accurate diary. Really.",
          link        => $url,
          description => "Telsa's diary of life with a hacker:" 
             . " the current ramblings");

foreach (split ('<dt>', $page))
{
if (/<a\sname="
         ([^"]*)     # Anchor name
         ">
         <strong>
         ([^>]*)     # Post title
         <\/strong><\/a><\/dt>\s*<dd>
         (.*)        # Body of post
         <\/dd>/six)
{
 $rss->add_item(title       => $2,
         link        => "$url#$1",
                description => encode_entities($3));
}
}

If you have a moment to better help me understand, my questions are:

  1. how does the following line work:

    ([^"]*) # Anchor name

  2. how does the following line work:

    ([^>]*) # Post title

  3. what does the "six" mean in the following line:

    <\/dd>/six)

Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!

+7  A: 

how does the following line work...

([^"]*) # Anchor name

zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.

how does the following line work...

([^>]*) # Post title

zero or more things which aren't >, captured as $1, $2, or whatever.

what does the "six" mean in the following line...

<\/dd>/six)

  • s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
  • i = match case insensitive
  • x = ignore whitespace in regex.

x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.

See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.

Kinopiko
thank you very much!
BeachRunnerJoe
"match as single line" is not very informative. /s means . matches any character, including newline, instead of the default any character except newline.
ysth
@ysth: I've altered it as you suggest.
Kinopiko
+1  A: 
  1. The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.

  2. Same.

  3. The characters are regex modifiers.
daotoad
+2  A: 
  1. [^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
  2. [^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
  3. that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
Blindy
re: #3 - No, it's not PHP-specific. They're flags used by PCRE regex libraries, which are available for many different languages. See the documentation at http://www.pcre.org/pcre.txt for full details of PCRE. (PCRE = Perl-Compatible Regular Expressions)
Dave Sherohman
regardless, I still don't know what they all mean :)
Blindy