ansaurus

Question

HTML::Treebuilder - Parse between parents

Answer 1

+2 A:

HTML::TreeBuilder version

#!/usr/bin/perl

use strict; use warnings;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse_file(\*DATA);
$tree->elementify;
$tree->objectify_text;

foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) {
    if ($atag->attr('id')) {
        printf "Found %s\n", $atag->as_XML;
        process_p( $atag );
    }
}

sub process_p {
    my ($tag) = @_;
    while ( defined( $tag ) and defined( my $next = $tag->right ) ) {
        last if lc $next->tag eq 'a';
        if ( lc $next->tag eq 'p') {
            $next->deobjectify_text;
            print $next->as_text, "\n";
        }
        $tag = $next;
    }
}

__DATA__
<html>
<body>
   <a id="111" name="111"></a>
   <p>something</p>
   <p>something</p>
   <p>something</p>sometext
   <a href=xxx">something</a>
   <a id="222" name="222"></a>
   <p>something</p>
   <p>something</p>
   <p>something</p>
 </body>
 </html>

Output:

Found <a id="111" name="111"></a>

something
something
something
Found <a id="222" name="222"></a>

something
something
something

HTML::TokeParser::Simple version

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $tag = $parser->get_tag('a') ) {
    next unless $tag->get_attr('id');
    printf "Found %s\n", $tag->as_is;
    process_p($parser);
}

sub process_p {
    my ($parser) = @_;
    while ( my $next = $parser->get_token ) {
        if ( $next->is_start_tag('a') ) {
            $parser->unget_token($next);
            return;
        }
        elsif ( $next->is_start_tag('p') ) {
            print $parser->get_text('/p'), "\n";
        }
    }
    return;
}

Output:

Found <a id="111" name="111">
something
something
something
Found <a id="222" name="222">
something
something
something

Sinan Ünür 2010-10-06 19:14:55

Thanks Sinan, this works almost perfect. I just noticed a issue in the html though, some of the "something" tags in the HTML actually look like "somethingsometext". When I try to run the above I get "Can't locate object method "tag" via package sometext".

Chris 2010-10-06 20:54:42

Is there some way to get examine the "sometext" that is appear and also be able to continue on without a error? TIA!!

Chris 2010-10-06 20:55:18

That throws a monkey wrench in things. That's because the string is not wrapped in an `HTML::Element`. I'll post a solution in a few minutes.

Sinan Ünür 2010-10-06 21:09:42

That would be awesome Sinan, thanks!!

Chris 2010-10-06 21:26:43

@Chris Done! However, I am beginning to think `HTML::TokeParser::Simple` might be more appropriate for this task.

Sinan Ünür 2010-10-06 21:34:48

Hmm, I still get the same error, the only difference now is I'm getting a warning of "$text->attr('text')" being a uninitialized value. Do you think I should be using TokeParser::Simple? I thought it also might have a problem with text outside of a tag, but I'll give it a shot.

Chris 2010-10-06 21:46:43

@Chris Fixed. The original script worked for the content you provided, but presumably, the `...` contain other elements than just text nodes.

Sinan Ünür 2010-10-06 21:54:08

Thanks Sinan, I'm going to give them both a try and see what works best. Thanks Again!!

Chris 2010-10-06 21:57:34

Sinan, I wonder if I could trouble you for one more bit of advise, Using the treebuilder method is there some way to have "$next->as_text" be the whole parent p tag? Right now if there is a nested P tag within the parent P tag then it stops at the end of the 1st nested P tag (if that makes sense).

Chris 2010-10-07 17:41:29

See http://www.w3.org/TR/html401/struct/text.html#h-9.3.1 "The P element represents a paragraph. It cannot contain block-level elements (including P itself)." Therefore, the HTML is invalid. However, in my tests, it seems like all the text in the nested `P` tags is printed. I am not inclined to investigate this further unless you post a separate question with applicable data.

Sinan Ünür 2010-10-08 01:08:38

ansaurus

tags:

views:

answers:

HTML::Treebuilder - Parse between parents

HTML::TreeBuilder version

HTML::TokeParser::Simple version

related questions