What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?
The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.
file_get_contents()
can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.
Out of curiosity, what are you trying to scrape?
Here's an OK tutorial (link removed, see below) on web scraping using cURL
and file_get_contents
. Besure to read the next few parts as well.
(direct hyperlink removed due to malware warnings)
http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php
I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?
There is a Book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" on this topic - see a review here
PHP-Architect covered it in a well written article in the December 2007 Issue by Matthew Turland
@Brian Warshaw: I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.
I've been developing a scraper for StackOverflow so that we can track what changes affected our reputation score. It's quite hackish, but it works:
http://modos.org/sof/?source=1
That should give you an idea of what it takes (CURL/regular expressions) to parse a page.
Christopher, what are you talking about? He is simply asking about how to implement a web scraper. There was nothing in his comment to warrant those sorts of assumptions.
If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.
Thanks to crono for kindly referring to my php|architect article. :) I am actually in the progress of writing a small book on the subject of web scraping with PHP. It will be published through php|architect and hopefully available before Q309. In the meantime, you can check out my blog at http://ishouldbecoding.com for the occasional post regarding web scraping.
I'd like to recommend this class I recently came across. Simple HTML DOM Parser
I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.
It sounds like you may be trying to 'hotlink' rather than scrape, i.e. update in realtime based on their site content?
This tutorial is quite good:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
You might also want to look at Prowser.
Scraper class from my framework:
<?php
/*
Example:
$site = $this->load->cls('scraper', 'http://www.anysite.com');
$excss = $site->getExternalCSS();
$incss = $site->getInternalCSS();
$ids = $site->getIds();
$classes = $site->getClasses();
$spans = $site->getSpans();
print '<pre>';
print_r($excss);
print_r($incss);
print_r($ids);
print_r($classes);
print_r($spans);
*/
class scraper
{
private $url = '';
public function __construct($url)
{
$this->url = file_get_contents("$url");
}
public function getInternalCSS()
{
$tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getExternalCSS()
{
$tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getIds()
{
$tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getClasses()
{
$tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
public function getSpans(){
$tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
$result = array();
array_push($result, $patterns[2]);
array_push($result, count($patterns[2]));
return $result;
}
}
?>
I agree with crono that book was the first book I have read on web scraping and there is my personal review on my website: Wade Cybertech - your spider, crawer, scraper resource center
I recommend to use ScrapePro Web Scraper Designer. http://www.scrapepro.com
ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.