ansaurus

Question

Fetch <td> text while using WWW::Mechanize to fetch <a> within that <td> tag

Answer 1

A:

I found that HTML::TreeBuilder is a great way of parsing HTML documents and pulling info out of them. In this case, something like:

use HTML::TreeBuilder;
my $page = get($URL);
my $tree = HTML::TreeBuilder->new_from_content($page);

foreach my $cell ($tree->look_down(_tag => "td")) {
   my $links = $cell->extract_links();
   foreach my $link (@$links) {
      print "href: ", $link->attr("href"), "; text: ", $link->as_text, "\n";
   }
}
$tree = $tree->delete;

Resources

zigdon 2010-09-10 21:31:57

Ummm ... `extract_links` does not work the way you seem to think it does. The return value is an arrrayref of arrayrefs, not an arrayref of elements.

cjm 2010-09-10 22:22:49

Answer 2

+4 A:

WWW::Mechanize is good at extracting links, but if you need to get other text, I usually combine it with HTML::TreeBuilder. Something like this:

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content($mech->content);

foreach my $td ($tree->look_down(_tag => 'td')) {

  # If there's no <a> in this <td>, then skip it:
  my $a = $td->look_down(_tag => 'a') or next;

  my $tdText = $td->as_text;
  my $aText  = $a->as_text;

  printf("td-text: %s\n a-text: %s\nhref: %s\ntitle: %s\n",
         $tdText, $aText, $a->attr('href'), $a->attr('title'));
}

The only problem with this code is that you don't want all of the text in the <td> tag. How you fix that is up to you. If the $aText is sufficiently unique, you might do something like:

$tdText =~ s/\Q$aText\E.*//s;

In the worst case, you'd have to write your own function to extract the text elements you want, stopping at the <br> (or however you determine the stopping point).

cjm 2010-09-10 22:36:12

In addition to that, I can recommend http://search.cpan.org/dist/HTML-TreeBuilder-LibXML/ which is an extension of HTML-TreeBuilder that also gives the programmer all the power of XPath and LibXML. I've been using it a lot for testing HTML pages recently.

Shlomi Fish 2010-09-11 13:12:32

ansaurus

tags:

views:

answers:

Fetch <td> text while using WWW::Mechanize to fetch <a> within that <td> tag

related questions