views:

54

answers:

1

Hi experts,

I'm pretty new to Perl/HTML. Here is what I'm trying to do with WWW::Mechanize and HTML::TreeBuilder:

For each chemical element page on Wikipedia, I need to extract all hyperlinks that point to the other chemical elements' pages on wiki and print each unique pair in this format:

Atomic_Number1 (Chemical Element Title1) -> Atomic_Number2 (Chemical Element Title2)

The only problem is that there is a mini periodic table on every chemical element's page (top-right of the page). So this tiny periodic table will just make the result same for every element. I'm having trouble on extracting all links from the page EXCEPT from that very table.

[Note: I only looked at $elem == 6 (Carbon) (@line 42) for the ease of debugging.]


Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table";

$mech->agent('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) /
              AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1   /
              Safari/533.17.8');

$mech->get($table_url);

my $tree = HTML::TreeBuilder->new_from_content($mech->content);
my %elem_set;
my $atomic_num;

## obtain a hash array of elements and corresponding titles and links
foreach my $td ($tree->look_down(_tag => 'td')) {

  # If there's no <a> in this <td>, then skip it:
  my $a = $td->look_down(_tag => 'a') or next;

  my $tdText = $td->as_text;
  my $aText  = $a->as_text;

  if($tdText =~ m/^(\d+)\S+$/){
    if($1 <= 114){  #only investigate up to 114th element
      $atomic_num = $1;
    }
    $elem_set{$atomic_num} = [$a->attr('title'), $a->attr('href')];
  }
}

## In each element's page. look for links to other elements in the set
foreach my $elem (keys %elem_set) {
  if($elem == 6){
    # reconstruct element url to ensure only fetch pages in English
    my $elem_url = "http://en.wikipedia.org" . $elem_set{$elem}[1];
    $mech->get($elem_url);

    #####################################################################
    ### need help here to exclude links from that mini periodic table ###
    #####################################################################

    my @target_links = $mech->links();
    for my $link ( @target_links ) {
      if( $link->url =~ m/^\/(wiki)\/.+$/ && $link->text =~ m/^\w+$/ ){
        printf("%s, %s\n", $link->text, $link->url);
      }
    }

  }
}
+2  A: 

Use WWW::Mechanize's update_html method to remove that table before finding the links. This method allows you to do whatever you want to the source code in $mech->content.

AmbroseChapel
Thanks! But it turns out that deleting tables on wiki pages is not a very accurate, not to mention efficient, way to achieve what I intended to do, since tables on each chemical elements' wiki pages have different things in their tags. So it's hard to generalize a table-delete function for all pages. I actually ended up using HTML::TreeBuilder to look for links within <p></p> tags (since the kind of links I'm looking for are very likely appear in paragraphs). It yielded much more accurate results and ran pretty fast.
Z.Zen