views:

6456

answers:

18

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

+1  A: 

The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.

Peter Stuifzand
A: 

file_get_contents() can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.

Out of curiosity, what are you trying to scrape?

Brian Warshaw
+2  A: 

Here's an OK tutorial (link removed, see below) on web scraping using cURL and file_get_contents. Besure to read the next few parts as well.

(direct hyperlink removed due to malware warnings)

http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

Ross
A: 

I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?

dlamblin
If you're going to use LWP, use WWW::Mechanize, which wraps it with handy helper functions.
Andy Lester
+7  A: 

There is a Book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" on this topic - see a review here

PHP-Architect covered it in a well written article in the December 2007 Issue by Matthew Turland

crono
A: 

@Brian Warshaw: I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.

Chaz Lever
+17  A: 
tyshock
This is great, just what I've been looking for. Do you have any other breakdown or improvements to the class?
Phill Pafford
Mm, parsing html with regexes is... well, I'll just let this guy explain: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Xiong Chiamiov
+1  A: 

I've been developing a scraper for StackOverflow so that we can track what changes affected our reputation score. It's quite hackish, but it works:

http://modos.org/sof/?source=1

That should give you an idea of what it takes (CURL/regular expressions) to parse a page.

Kyle Cronin
A: 

Christopher, what are you talking about? He is simply asking about how to implement a web scraper. There was nothing in his comment to warrant those sorts of assumptions.

gaoshan88
I'm still waiting for the "legitimate reuse" bit in my earlier comment. That the questioner runs an internet-based marketing company does not inspire confidence.
Christopher Mahan
A: 

If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.

troelskn
A: 

Thanks to crono for kindly referring to my php|architect article. :) I am actually in the progress of writing a small book on the subject of web scraping with PHP. It will be published through php|architect and hopefully available before Q309. In the meantime, you can check out my blog at http://ishouldbecoding.com for the occasional post regarding web scraping.

+6  A: 

I'd like to recommend this class I recently came across. Simple HTML DOM Parser

SoulBlighter
+1, jQuery-like selectors, and totally awesome.
Daniel
A: 

I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.

It sounds like you may be trying to 'hotlink' rather than scrape, i.e. update in realtime based on their site content?

This tutorial is quite good:

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

You might also want to look at Prowser.

Aaron Newton
A: 

Scraper class from my framework:

<?php

/*
 Example:

 $site = $this->load->cls('scraper', 'http://www.anysite.com');
 $excss = $site->getExternalCSS();
 $incss = $site->getInternalCSS();
 $ids = $site->getIds();
 $classes = $site->getClasses();
 $spans = $site->getSpans(); 

 print '<pre>';
 print_r($excss);
 print_r($incss);
 print_r($ids);
 print_r($classes);
 print_r($spans);  

*/

class scraper
{
 private $url = '';

 public function __construct($url)
 {
  $this->url = file_get_contents("$url");
 }

 public function getInternalCSS()
 {
  $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getExternalCSS()
 {
  $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getIds()
 {
  $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getClasses()
 {
  $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getSpans(){
  $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

}
?>
Sarfraz
A: 

I agree with crono that book was the first book I have read on web scraping and there is my personal review on my website: Wade Cybertech - your spider, crawer, scraper resource center

detectedstealth
A: 

here is another one: a simple PHP Scraper without Regex.

php html
A: 

I recommend to use ScrapePro Web Scraper Designer. http://www.scrapepro.com

csharpp
A: 

ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.

ZhanZhuang