ansaurus

Question

Answer 1

+5 A:

If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:

## print every <h2> tag in page
use pQuery;

pQuery("http://google.com/search?q=pquery")
    ->find("h2")
    ->each(sub {
        my $i = shift;
        print $i + 1, ") ", pQuery($_)->text, "\n";
    });

There's also HTML::DOM.

Whatever you do, though, don't use regular expressions for this.

Paolo Bergantino 2009-04-03 13:13:06

Answer 2

+4 A:

I have used HTML Table Extract in the past. I personally find it a bit clumsy to use, but maybe I did not understand the object model well. I usually use this part of the manual to examine the data:

 use HTML::TableExtract;
 $te = HTML::TableExtract->new();
 $te->parse($html_string);

     # Examine all matching tables
     foreach $ts ($te->tables) {
       print "Table (", join(',', $ts->coords), "):\n";
       foreach $row ($ts->rows) {
          print join(',', @$row), "\n";
       }
     }`

weismat 2009-04-03 13:21:11

HTML::TableExtract is quite magical. One great feature is being able to select tables by specifying the contents of header cells and being able to keep only the columns you are interested in.

Sinan Ünür 2009-05-08 02:05:47

Answer 3

A:

I use LWP::UserAgent for most of my screen scraping needs. You can also Couple that with HTTP::Cookies if you need Cookies support.

Here's a simple example on how to get source.

use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);

$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
   # Play with your source here
   $source = $resp->content;
   $source =~ s/^.*<table>/<table>/i; # this is just an example 
   print $source;                     # not a solution to your problem.
}

J.J. 2009-04-03 14:47:13

Answer 4

+4 A:

Although I've generally done this with LWP/LWP::Simple, the current 'preferred' module for any sort of webpage scraping in Perl is WWW::Mechanize.

Dave Sherohman 2009-04-03 15:11:48

David: Can you expand on this. I always thought WWW::Mechanize was more for automated testing. What puts it a cut above?

J.J. 2009-04-03 18:22:14

WWW::Mechanize is for any sort of interaction with a website. It was never targeted just at automated testing.

brian d foy 2009-04-03 21:42:42

However, Test::WWW::Mechanize *is* targeted just at automated testing. It is a wrapper around WWW::Mechanize.

Andy Lester 2009-08-27 21:40:16

Answer 5

+2 A:

If you're familiar with XPath, you can also use HTML::TreeBuilder::XPath. And if you're not... well you should be ;--)

mirod 2009-04-03 20:34:58

Answer 6

+2 A:

For similar Stackoverflow questions have a look at....

I do like using pQuery for things like this however Web::Scraper does look interesting.

/I3az/

draegtun 2009-04-05 13:18:45

Answer 7

+1 A:

I don't mean to drag up a dead thread but anyone googling across this thread should also checkout WWW::Scripter - 'For scripting web sites that have scripts'

happy remote data aggregating ;)

mr.szgz 2009-12-10 14:34:15

Answer 8

+1 A:

Take a look at the magical Web::Scraper, it's THE tool for web scraping.

bem33 2009-12-10 14:54:24

Answer 9

A:

Check out this little example of web scraping with perl: link text

juFo 2010-05-10 13:18:54

ansaurus

tags:

views:

answers:

How can I screen scrape with Perl?

related questions