views:

816

answers:

5

Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules.

I have HTML like this:

<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

I want to parse out the /45/subtitles-67624.asp, but more importantly I want to know how to parse out the contents of the div.

I was given this example on a previous question:

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
 #http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles-272112.aspx
        push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;
    }

This worked perfectly for that, but when I tried to edit it a bit and use it on a `div it didn't work. Here is the code I tried:

I tried using this code:

while (my $anchor = $p->get_tag("dt")) {
  if($stuff = $anchor->get_attr('a1')) {
    print $stuff."\n";
  }
}
+1  A: 

get_attr('a1') should have probably read get_attr('id') and it would print "a1"

I think getting the text content would look like:

while ( my $anchor = $parser->get_tag('div') ) {
  my $content = $parser-get_text('/div');
}

Or if you meant the text content of the link it would be:

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        my $content = $parser->get_text('/a');
#http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitle-272112.aspx
        push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;
    }
dlamblin
Thank you, that helped, the other part of the question is how to get the text of whats between <div id='a1'>GETTHISCONTENT</div>.Can you help with that? Thanks!
Codygman
Thanks for the help, sorry for the confusion, I guess less is more on here. My overall goal is to get the a href link out of the <dt> tags in that specified div container.
Codygman
+1  A: 

You need to change the get_attr("a1") to get_attr("id") here. The get_attr (x) is looking for an attribute with the name x, but you are giving it the value of the attribute, not its name.

Incidentally the <dt> tag is not a <div>, it is the item tag for a <dl> (definition list).

Kinopiko
+2  A: 

Code using HTML::TreeBuilder:

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content($html);

for my $link ($tree->look_down(
  _tag => 'a', 
  href => qr{/subtitle-\d{2,8}\.aspx})
) {
  my $linkid = $link->attr('href') =~ m!/subtitle-\d{2,8}\.aspx!;
  # Scalar context gets the first, and the first is the nearest parent
  my $parent_div = $link->look_up(_tag => 'div');
  # Now the interesting bit of the link is in $linkid, the parent div ID
  # is $parent_div->id or $parent_div->attr_id, and its text is e.g.
  # $parent_div->as_trimmed_text or you can do other stuff with its content.
}
hobbs
I wish I could vote up! :)Thanks, I try not to bother you guys too much, but after an hour of trying to figure this out I was soo frustrated!
Codygman
The different parser subclasses are all good for different kinds of work. TokeParser is one of the simplest and fastest, but when you want to move up and down in the tag structure, TreeBuilder should be on your mind instead.
hobbs
And I'm emphatically *not* begging for votes, but you now have 21 rep and can upvote me if you so choose, and you should also "accept" one of the answers to your question if you're satisfied.
hobbs
Alrigthy! Will do, I didn't notice that :)
Codygman
+3  A: 

You could use (yet another module!) HTML::TreeBuilder::XPath, which, as per its name, will let you use XPath on HTML::TreeBuilder objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $root = HTML::TreeBuilder::XPath->new_from_file( "my.html");

# print $root->as_HTML; # useful to see how HTML::TreeBuilder
# understands your HTML. For example it will wrap the implied
# dl element around dt, which you need to take into account
# when writing the XPath query below

my $id= "a1";
# you need the .//dt because of the extra dl
my @divs= $root->findnodes( qq{//div[.//dt[\@id="$id"]]});

print $divs[0]->as_HTML; # or as_text
mirod
Thanks mirod, using xpath seems like it will really help my RAD :)The comments were really helpful too, knowing how it understands my html is very important.
Codygman
+2  A: 

To address, your specific question, given the HTML:

<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

I am assuming you are interested in the anchor text, i.e. ".45 (2006)", in this case, but only if the anchor occurs in a div with id listSubtitlesFilm.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @dnldLinks;

while ( my $div = $parser->get_tag('div') ) {
    my $id = $div->get_attr('id');
    next unless defined($id) and $id eq 'listSubtitlesFilm';

    my $anchor = $parser->get_tag('a');
    my $href = $anchor->get_attr('href');
    next unless defined($href)
        and $href =~ m!/subtitles-(\d{2,8})\.aspx\z!;
    push @dnldLinks, [$parser->get_trimmed_text('/a'), $1];
}

use Data::Dumper;
print Dumper \@dnldLinks;


__DATA__
<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

Output:

$VAR1 = [
          [
            '.45 (2006)',
            '67624'
          ]
        ];
Sinan Ünür
Thanks SO much for the detailed explanation Sinan! Your making me love perl! :P
Codygman