ansaurus

Question

Perl web scraper, extract content from DIV that only has "style" tag?

Answer 1

+6 A:

If you want a DOM parser (easier to use tree browsing, slightly slower). Try HTML::TreeBuilder

HTML::Element man page (module is included)

Note also that look_down considers "" (empty-string) and undef to be
different things, in attribute values. So this:
  $h->look_down("alt", "")

Which leads us to your answer:

use HTML::TreeBuilder;

# check html::treebuilder pod, there are a few ways to construct (file, fh, html string)
my $tb = HTML::TreeBuilder->new_from_(constructor)

$tb->look_down( _tag => 'div', style => '' )->as_text;

Evan Carroll 2010-07-15 03:46:54

Thanks.. yeah I'm just kind of confused as to what parser is the correct one to use as there are different ones out there... I will look into that, thanks for taking the time to post :)

Rick 2010-07-15 03:52:21

HTML::TreeBuilder is the only one I use. It handles bad HTML extremely well and is substantially easier to use and faster to develop with. Tokeparsing is however much faster if your task is this simple -- but the speed probably doesn't matter.

Evan Carroll 2010-07-15 04:00:42

yeah, speed doesn't matter... I agree about tokeparser, I think its not good for handling bad HTML which is why it was giving me problems on this.. thanks for your help in this, I am going to learn treebuilder in and out now :)

Rick 2010-07-15 04:13:32

if you switch to HTML::TreeBuilder, have a look at HTML::TreeBuilder::XPath (http://search.cpan.org/dist/HTML-TreeBuilder-XPath/)

mirod 2010-07-15 05:37:05

TokeParser and TreeBuilder use the same parsing engine under the hood -- if your TokeParser code is bad at handling bad HTML it's only your fault. Anyway, the main use for TokeParser is when keeping the whole DOM in memory would be too expensive -- which is rarely.

hobbs 2010-07-15 17:45:58

Answer 2

+1 A:

using Web::Scraper, try :

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper::Simple;
use Web::Scraper;

$Data::Dumper::Indent = 1;

my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';


my $proxyscraper = scraper {
    process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {
       process '//span', 'data1' => 'TEXT';
       process '//text()', 'data2' => 'TEXT';
     }
};

my $results = $proxyscraper->scrape( $html );

print Dumper($results);

It give :

$results = {
  'proxiesextracted' => [
    {
      'data2' => 'test1_a',
      'data1' => 'test1'
    },
    {
      'data2' => 'test2_a',
      'data1' => 'test2'
    },
    {
      'data2' => 'test3_a',
      'data1' => 'test3'
    }
  ]
};

Hope this helps

bem33 2010-07-15 15:46:47

ansaurus

tags:

views:

answers:

Perl web scraper, extract content from DIV that only has "style" tag?

related questions