ansaurus

Question

Answer 1

+1 A:

The curl library allows you to download web pages. You should look into regular expressions for doing the scraping.

Peter Stuifzand 2008-08-25 21:30:01

Answer 2

A:

file_get_contents() can take a remote URL and give you the source. You can then use regular expressions (with the Perl-compatible functions) to grab what you need.

Out of curiosity, what are you trying to scrape?

Brian Warshaw 2008-08-25 21:31:03

Answer 3

+2 A:

Here's an OK tutorial (link removed, see below) on web scraping using cURL and file_get_contents. Besure to read the next few parts as well.

(direct hyperlink removed due to malware warnings)

http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

Ross 2008-08-25 21:34:02

Answer 4

A:

I'd either use libcurl or Perl's LWP (libwww for perl). Is there a libwww for php?

dlamblin 2008-08-25 21:39:43

If you're going to use LWP, use WWW::Mechanize, which wraps it with handy helper functions.

Andy Lester 2008-09-24 05:33:47

Answer 5

+7 A:

There is a Book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" on this topic - see a review here

PHP-Architect covered it in a well written article in the December 2007 Issue by Matthew Turland

crono 2008-08-25 23:21:53

Answer 6

A:

@Brian Warshaw: I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.

Chaz Lever 2008-08-28 20:10:34

Answer 7

+17 A:

tyshock 2008-09-19 16:40:07

This is great, just what I've been looking for. Do you have any other breakdown or improvements to the class?

Phill Pafford 2009-10-12 15:43:53

Mm, parsing html with regexes is... well, I'll just let this guy explain: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Xiong Chiamiov 2010-07-20 07:03:34

Answer 8

+1 A:

I've been developing a scraper for StackOverflow so that we can track what changes affected our reputation score. It's quite hackish, but it works:

http://modos.org/sof/?source=1

That should give you an idea of what it takes (CURL/regular expressions) to parse a page.

Kyle Cronin 2008-09-19 16:47:32

Answer 9

A:

Christopher, what are you talking about? He is simply asking about how to implement a web scraper. There was nothing in his comment to warrant those sorts of assumptions.

gaoshan88 2008-09-19 19:18:42

I'm still waiting for the "legitimate reuse" bit in my earlier comment. That the questioner runs an internet-based marketing company does not inspire confidence.

Christopher Mahan 2008-09-29 16:47:00

Answer 10

A:

If you need something that is easy to maintain, rather than fast to execute, it could help to use a scriptable browser, such as SimpleTest's.

troelskn 2008-09-19 21:49:25

Answer 11

A:

Thanks to crono for kindly referring to my php|architect article. :) I am actually in the progress of writing a small book on the subject of web scraping with PHP. It will be published through php|architect and hopefully available before Q309. In the meantime, you can check out my blog at http://ishouldbecoding.com for the occasional post regarding web scraping.

2009-01-03 00:10:25

Answer 12

+6 A:

I'd like to recommend this class I recently came across. Simple HTML DOM Parser

SoulBlighter 2009-04-21 07:43:48

+1, jQuery-like selectors, and totally awesome.

Daniel 2009-12-26 06:30:49

Answer 13

A:

I'm actually looking to scrape BibleGateway.com as they don't provide an API to access verses for a web app I'm looking to create.

It sounds like you may be trying to 'hotlink' rather than scrape, i.e. update in realtime based on their site content?

This tutorial is quite good:

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

You might also want to look at Prowser.

Aaron Newton 2009-12-23 07:40:31

Answer 14

A:

Scraper class from my framework:

<?php

/*
 Example:

 $site = $this->load->cls('scraper', 'http://www.anysite.com');
 $excss = $site->getExternalCSS();
 $incss = $site->getInternalCSS();
 $ids = $site->getIds();
 $classes = $site->getClasses();
 $spans = $site->getSpans(); 

 print '<pre>';
 print_r($excss);
 print_r($incss);
 print_r($ids);
 print_r($classes);
 print_r($spans);  

*/

class scraper
{
 private $url = '';

 public function __construct($url)
 {
  $this->url = file_get_contents("$url");
 }

 public function getInternalCSS()
 {
  $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getExternalCSS()
 {
  $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getIds()
 {
  $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getClasses()
 {
  $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

 public function getSpans(){
  $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
  $result = array();
  array_push($result, $patterns[2]);
  array_push($result, count($patterns[2]));
  return $result;
 }

}
?>

Sarfraz 2009-12-26 06:19:02

Answer 15

A:

I agree with crono that book was the first book I have read on web scraping and there is my personal review on my website: Wade Cybertech - your spider, crawer, scraper resource center

detectedstealth 2010-05-14 12:40:05

Answer 16

A:

here is another one: a simple PHP Scraper without Regex.

php html 2010-06-19 13:41:44

Answer 17

A:

I recommend to use ScrapePro Web Scraper Designer. http://www.scrapepro.com

csharpp 2010-08-30 21:04:43

Answer 18

A:

ScraperWiki is a pretty interesting project. Helps you build scrapers online in Python, Ruby or PHP - i was able to get a simple attempt up in a few minutes.

ZhanZhuang 2010-09-24 04:50:43

ansaurus

tags:

views:

answers:

How to implement a web scraper in PHP?

related questions