ansaurus

Question

How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?

Answer 1

+1 A:

get_attr('a1') should have probably read get_attr('id') and it would print "a1"

I think getting the text content would look like:

while ( my $anchor = $parser->get_tag('div') ) {
  my $content = $parser-get_text('/div');
}

Or if you meant the text content of the link it would be:

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        my $content = $parser->get_text('/a');
#http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitle-272112.aspx
        push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;
    }

dlamblin 2009-11-07 07:57:28

Thank you, that helped, the other part of the question is how to get the text of whats between <div id='a1'>GETTHISCONTENT</div>.Can you help with that? Thanks!

Codygman 2009-11-07 08:01:10

Thanks for the help, sorry for the confusion, I guess less is more on here. My overall goal is to get the a href link out of the <dt> tags in that specified div container.

Codygman 2009-11-07 08:11:50

Answer 2

+1 A:

You need to change the get_attr("a1") to get_attr("id") here. The get_attr (x) is looking for an attribute with the name x, but you are giving it the value of the attribute, not its name.

Incidentally the <dt> tag is not a <div>, it is the item tag for a <dl> (definition list).

Kinopiko 2009-11-07 07:58:20

Answer 3

+2 A:

Code using HTML::TreeBuilder:

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content($html);

for my $link ($tree->look_down(
  _tag => 'a', 
  href => qr{/subtitle-\d{2,8}\.aspx})
) {
  my $linkid = $link->attr('href') =~ m!/subtitle-\d{2,8}\.aspx!;
  # Scalar context gets the first, and the first is the nearest parent
  my $parent_div = $link->look_up(_tag => 'div');
  # Now the interesting bit of the link is in $linkid, the parent div ID
  # is $parent_div->id or $parent_div->attr_id, and its text is e.g.
  # $parent_div->as_trimmed_text or you can do other stuff with its content.
}

hobbs 2009-11-07 08:13:40

I wish I could vote up! :)Thanks, I try not to bother you guys too much, but after an hour of trying to figure this out I was soo frustrated!

Codygman 2009-11-07 08:21:14

The different parser subclasses are all good for different kinds of work. TokeParser is one of the simplest and fastest, but when you want to move up and down in the tag structure, TreeBuilder should be on your mind instead.

hobbs 2009-11-07 08:51:40

And I'm emphatically *not* begging for votes, but you now have 21 rep and can upvote me if you so choose, and you should also "accept" one of the answers to your question if you're satisfied.

hobbs 2009-11-07 08:53:57

Alrigthy! Will do, I didn't notice that :)

Codygman 2009-11-11 21:20:35

Answer 4

+3 A:

You could use (yet another module!) HTML::TreeBuilder::XPath, which, as per its name, will let you use XPath on HTML::TreeBuilder objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $root = HTML::TreeBuilder::XPath->new_from_file( "my.html");

# print $root->as_HTML; # useful to see how HTML::TreeBuilder
# understands your HTML. For example it will wrap the implied
# dl element around dt, which you need to take into account
# when writing the XPath query below

my $id= "a1";
# you need the .//dt because of the extra dl
my @divs= $root->findnodes( qq{//div[.//dt[\@id="$id"]]});

print $divs[0]->as_HTML; # or as_text

mirod 2009-11-07 08:35:30

Thanks mirod, using xpath seems like it will really help my RAD :)The comments were really helpful too, knowing how it understands my html is very important.

Codygman 2009-11-11 21:03:31

Answer 5

+2 A:

To address, your specific question, given the HTML:

<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

I am assuming you are interested in the anchor text, i.e. ".45 (2006)", in this case, but only if the anchor occurs in a div with id listSubtitlesFilm.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @dnldLinks;

while ( my $div = $parser->get_tag('div') ) {
    my $id = $div->get_attr('id');
    next unless defined($id) and $id eq 'listSubtitlesFilm';

    my $anchor = $parser->get_tag('a');
    my $href = $anchor->get_attr('href');
    next unless defined($href)
        and $href =~ m!/subtitles-(\d{2,8})\.aspx\z!;
    push @dnldLinks, [$parser->get_trimmed_text('/a'), $1];
}

use Data::Dumper;
print Dumper \@dnldLinks;


__DATA__
<div id="listSubtitlesFilm">
  <dt id="a1">
    <a href="/45/subtitles-67624.aspx">
      .45 (2006)
    </a>
  </dt>
</div>

Output:

$VAR1 = [
          [
            '.45 (2006)',
            '67624'
          ]
        ];

Sinan Ünür 2009-11-07 12:03:46

Thanks SO much for the detailed explanation Sinan! Your making me love perl! :P

Codygman 2009-11-08 06:54:05

ansaurus

tags:

views:

answers:

How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?

related questions