I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas?
If you are familiar with jQuery you might want to check out pQuery, which makes this very easy:
## print every <h2> tag in page
use pQuery;
pQuery("http://google.com/search?q=pquery")
->find("h2")
->each(sub {
my $i = shift;
print $i + 1, ") ", pQuery($_)->text, "\n";
});
There's also HTML::DOM.
Whatever you do, though, don't use regular expressions for this.
I have used HTML Table Extract in the past. I personally find it a bit clumsy to use, but maybe I did not understand the object model well. I usually use this part of the manual to examine the data:
use HTML::TableExtract;
$te = HTML::TableExtract->new();
$te->parse($html_string);
# Examine all matching tables
foreach $ts ($te->tables) {
print "Table (", join(',', $ts->coords), "):\n";
foreach $row ($ts->rows) {
print join(',', @$row), "\n";
}
}`
I use LWP::UserAgent for most of my screen scraping needs. You can also Couple that with HTTP::Cookies if you need Cookies support.
Here's a simple example on how to get source.
use LWP;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $browser = LWP::UserAgent->new;
$browser->cookie_jar($cookie_jar);
$resp = $browser->get("https://www.stackoverflow.com");
if($resp->is_success) {
# Play with your source here
$source = $resp->content;
$source =~ s/^.*<table>/<table>/i; # this is just an example
print $source; # not a solution to your problem.
}
Although I've generally done this with LWP/LWP::Simple, the current 'preferred' module for any sort of webpage scraping in Perl is WWW::Mechanize.
If you're familiar with XPath, you can also use HTML::TreeBuilder::XPath. And if you're not... well you should be ;--)
For similar Stackoverflow questions have a look at....
- How can I extract URLs from a web page in Perl
- How can I extract XML of a website and save in a file using Perl’s LWP?
I do like using pQuery for things like this however Web::Scraper does look interesting.
/I3az/
I don't mean to drag up a dead thread but anyone googling across this thread should also checkout WWW::Scripter - 'For scripting web sites that have scripts'
happy remote data aggregating ;)
Take a look at the magical Web::Scraper, it's THE tool for web scraping.