views:

58

answers:

2

I'm stuck on this and have been all day.. I'm still pretty new to parsing / scraping in perl but I thought I had it down until this.. I have been trying this with different perl modules (tokeparser, tokeparser:simple, web parser and some others)... I have the following string (which in reality is actually an entire HTML page, but this is just showing the relevant part.. I am trying to extract "text1" and "text1_a".. and so on (the "text1", etc is just put in there as an example)... so basically I think I need to extract this first from each:

"<span style="float: left;">test1</span>test1_a"

Then to parse this to get the 2 values.. I don't know why this is giving me so much trouble as I thought I could just do it in tokeparser:simple but I couldn't seem to return the value inside of the DIV, I wonder if its because it contains another set of tags (the tags)

string (represents html web page)

<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right: 10px; float: right;">
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>

my attempt in perl web parser module:

my $uri  = URI->new($theurl);

my $proxyscraper = scraper {
process 'div[style=~"width: 250px; text-align: right;"]',
'proxiesextracted[]' => scraper {
process '.style',  style => 'TEXT';
};
result 'proxiesextracted';

I'm just kind of blindly trying to make sense of the web:parser module as there is essentially no documentation on it so I just pieced that together from the examples they included with the module and one I found on the internet.. any advice is greatly appreciated.

+6  A: 

If you want a DOM parser (easier to use tree browsing, slightly slower). Try HTML::TreeBuilder

HTML::Element man page (module is included)

Note also that look_down considers "" (empty-string) and undef to be

different things, in attribute values. So this:

  $h->look_down("alt", "")

Which leads us to your answer:

use HTML::TreeBuilder;

# check html::treebuilder pod, there are a few ways to construct (file, fh, html string)
my $tb = HTML::TreeBuilder->new_from_(constructor)

$tb->look_down( _tag => 'div', style => '' )->as_text;
Evan Carroll
Thanks.. yeah I'm just kind of confused as to what parser is the correct one to use as there are different ones out there... I will look into that, thanks for taking the time to post :)
Rick
HTML::TreeBuilder is the only one I use. It handles bad HTML extremely well and is substantially easier to use and faster to develop with. Tokeparsing is however much faster if your task is this simple -- but the speed probably doesn't matter.
Evan Carroll
yeah, speed doesn't matter... I agree about tokeparser, I think its not good for handling bad HTML which is why it was giving me problems on this.. thanks for your help in this, I am going to learn treebuilder in and out now :)
Rick
if you switch to HTML::TreeBuilder, have a look at HTML::TreeBuilder::XPath (http://search.cpan.org/dist/HTML-TreeBuilder-XPath/)
mirod
TokeParser and TreeBuilder use the same parsing engine under the hood -- if your TokeParser code is bad at handling bad HTML it's only your fault. Anyway, the main use for TokeParser is when keeping the whole DOM in memory would be too expensive -- which is rarely.
hobbs
+1  A: 

using Web::Scraper, try :

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper::Simple;
use Web::Scraper;

$Data::Dumper::Indent = 1;

my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';


my $proxyscraper = scraper {
    process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {
       process '//span', 'data1' => 'TEXT';
       process '//text()', 'data2' => 'TEXT';
     }
};

my $results = $proxyscraper->scrape( $html );

print Dumper($results);

It give :

$results = {
  'proxiesextracted' => [
    {
      'data2' => 'test1_a',
      'data1' => 'test1'
    },
    {
      'data2' => 'test2_a',
      'data1' => 'test2'
    },
    {
      'data2' => 'test3_a',
      'data1' => 'test3'
    }
  ]
};

Hope this helps

bem33