views:

2611

answers:

9

I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas?

+5  A: 

If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:

## print every <h2> tag in page
use pQuery;

pQuery("http://google.com/search?q=pquery")
    ->find("h2")
    ->each(sub {
        my $i = shift;
        print $i + 1, ") ", pQuery($_)->text, "\n";
    });

There's also HTML::DOM.

Whatever you do, though, don't use regular expressions for this.

Paolo Bergantino
+4  A: 

I have used HTML Table Extract in the past. I personally find it a bit clumsy to use, but maybe I did not understand the object model well. I usually use this part of the manual to examine the data:

 use HTML::TableExtract;
 $te = HTML::TableExtract->new();
 $te->parse($html_string);

     # Examine all matching tables
     foreach $ts ($te->tables) {
       print "Table (", join(',', $ts->coords), "):\n";
       foreach $row ($ts->rows) {
          print join(',', @$row), "\n";
       }
     }`
weismat
HTML::TableExtract is quite magical. One great feature is being able to select tables by specifying the contents of header cells and being able to keep only the columns you are interested in.
Sinan Ünür
A: 

I use LWP::UserAgent for most of my screen scraping needs. You can also Couple that with HTTP::Cookies if you need Cookies support.

Here's a simple example on how to get source.

use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);

$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
   # Play with your source here
   $source = $resp->content;
   $source =~ s/^.*<table>/<table>/i; # this is just an example 
   print $source;                     # not a solution to your problem.
}
J.J.
+4  A: 

Although I've generally done this with LWP/LWP::Simple, the current 'preferred' module for any sort of webpage scraping in Perl is WWW::Mechanize.

Dave Sherohman
David: Can you expand on this. I always thought WWW::Mechanize was more for automated testing. What puts it a cut above?
J.J.
WWW::Mechanize is for any sort of interaction with a website. It was never targeted just at automated testing.
brian d foy
However, Test::WWW::Mechanize *is* targeted just at automated testing. It is a wrapper around WWW::Mechanize.
Andy Lester
+2  A: 

If you're familiar with XPath, you can also use HTML::TreeBuilder::XPath. And if you're not... well you should be ;--)

mirod
+2  A: 

For similar Stackoverflow questions have a look at....

I do like using pQuery for things like this however Web::Scraper does look interesting.

/I3az/

draegtun
+1  A: 

I don't mean to drag up a dead thread but anyone googling across this thread should also checkout WWW::Scripter - 'For scripting web sites that have scripts'

happy remote data aggregating ;)

mr.szgz
+1  A: 

Take a look at the magical Web::Scraper, it's THE tool for web scraping.

bem33
A: 

Check out this little example of web scraping with perl: link text

juFo