views:

79

answers:

2

I'm currently working on a Perl script to gather data from the QuakeLive website. Everything was going fine until I couldn't get a set of data.

I was using regexes for that and they work for everything apart from the favourite arena, weapon and game type. I just need to get the names of those three elements in a $1 for further processing.

I tried regexing up to the favorites image, but without succeeding. If it's any use, I'm already using WWW::Mechanize in the script.

I think that the problem could be related to the class name of the paragraphs where those elements are, while the previous one was classless.

You can find an example profile HERE.

Note that for the previous part of the page, it worked using code like:

$content =~ /<b>Wins:<\/b> (.*?)<br \/>/;
$wins = $1;
print "Wins: $wins\n";
+5  A: 

Using regular expressions for this particular task is less than ideal. There are just too many things that might change, and you're not taking advantage of inherent structure of HTML pages. Have you considered using something like HTML::TreeBuilder instead? It will allow you to say "get me the value of the 3rd table cell in the table named weapons", etc.

zigdon
+7  A: 

The immediate problem is that you have:

<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif" 
     width="17" height="17" alt="" class="fl fivepxhr" />
                <b>Arena:</b> Campgrounds
                <div class="cl"></div>
            </p>

That is, there is no <br /> following the value for favorites such as Arena. Now, the correct way to do this would involve using a proper HTML parser. The fragile solution is to adapt your pattern (untested):

my ($favarena) = $content =~ m{<b>Arena:</b> ([^<]+)};

That should put everything up to the < of the next <div> in $favarena. Now, if all arenas are single words with no spaces in them,

my ($favarena) = $content =~ m{<b>Arena:</b> (\S+)};

would save you the trouble of having to trim whitespace afterwards.

Note that it is easy for such regex based solutions to be fooled with simple things like commented out snippets in the source. E.g., if the source were to be changed to:

<p class="prf_faves">
<img src="http://cdn.quakelive.com/web/2010092807/images/profile/none_v2010092807.0.gif" 
     width="17" height="17" alt="" class="fl fivepxhr" />
<!-- <b>Arena: </b> here -->
                <b>Arena:</b> Campgrounds
                <div class="cl"></div>
            </p>

your script would be in trouble where as a solution using an HTML parser would not.

An example using HTML::TokeParser::Simple:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new( 'martianbuddy.html' );

while ( my $tag = $p->get_tag('p') ) {
    next unless $tag->is_start_tag;
    next unless defined (my $class = $tag->get_attr('class'));
    next unless grep { /^prf_faves\z/ } split ' ', $class;

    my $fav = $p->get_tag('b');
    my $type = $p->get_text('/b');
    my $value = $p->get_text('/p');
    $value =~ s/\s+\z//;

    print "$type = $value\n";
}

Output:

Arena:  Campgrounds
Game Type:  Clan Arena
Weapon:  Rocket Launcher

And, here is an example using HTML::TreeBuilder:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TreeBuilder;
use YAML;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file('martianbuddy.html');

my @p = $tree->look_down(_tag => 'p', sub {
        return unless defined (my $class = $_[0]->attr('class'));
        return unless grep { /^prf_faves\z/ } split ' ', $class;
        return 1;
    }
);

for my $p ( @p ) {
    my $text = $p->as_text;
    $text =~ s/^\s+//;
    my ($type, $value) = split ': ', $text;
    print "$type: $value\n";
}

Output:

Arena: Campgrounds 
Game Type: Clan Arena 
Weapon: Rocket Launcher

Given that the document is an HTML fragment rather than a full document, you will have more success with modules based on HTML::Parser rather than those that expect to operate on well-formed XML documents.

Sinan Ünür
This works! Since you've been so helpful, would you mind suggesting a good way to parse HTML for this script? Which module would you suggest?
Gurzo
Ok, read your edit and got it. Now I'm left with only one problem: getting an HTML file (to try your example I manually downloaded the page). As you may have noticed, the website does not give a direct link (/summary/martianbuddy) and adding .html to the end does not work. Any ideas?
Gurzo
@TheGiantPanda: Obviously, I saved a local copy of the file. Now, you did mention you are using `WWW::Mechanize` to crawl the site. That's how you get the content.
Sinan Ünür
I just discovered that Mechanize has a simple way to strip HTML from a page: $mech->content( format => 'text' ). Seems like a viable thing.
Gurzo