Preventing my PHP Web Crawler from Stalling | ansaurus

tags:

views:

35

answers:

1

+1 Q:

Preventing my PHP Web Crawler from Stalling

I'm using the PHPCrawl class and added some DOMDocument and DOMXpath code to take specific data off web pages however the script stalls out before it gets even close to crawling the whole website.

I have set_time_limit set to 100000000 so that shouldn't be an issue.

Any ideas?

Thank you, Nick

<?php

// It may take a while to crawl a site ...
set_time_limit(100000000);

// Inculde the phpcrawl-mainclass
include("classes/phpcrawler.class.php");

//connect to the database
mysql_connect('localhost','#####','#####');
mysql_select_db('ft2');

// Extend the class and override the handlePageData()-method
class MyCrawler extends PHPCrawler 
{
  function handlePageData(&$page_data) 
  {
    // Here comes your code.
    // Do whatever you want with the information given in the
    // array $page_data about a page or file that the crawler actually found.
    // See a complete list of elements the array will contain in the 
    // class-refenence.
    // This is just a simple example.

    // Print the URL of the actual requested page or file
    echo "Page requested: ".$page_data["url"]."<br>";

    // Print the first line of the header the server sent (HTTP-status)
    //echo "Status: ".strtok($page_data["header"], "\n")."<br>";

    // Print the referer
    //echo "Referer-page: ".$page_data["referer_url"]."<br>";

    // Print if the content was be recieved or not
    /*if ($page_data["received"]==true)
      echo "Content received: ".$page_data["bytes_received"]." bytes";
    else
      echo "Content not received";
    */
    // ...

    // Now you should do something with the content of the actual
    // received page or file ($page_data[source]), we skip it in this example

    //echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();

 //this is where we tear the data apart looking for username and timestamps
 $url = $page_data["url"];
 $html = new DOMDocument(); 
 $html->loadHTMLFile($url);

 $xpath = new DOMXpath($html);

 //children of ol id=posts
 $links = $xpath->query( "//li[@class='postbit postbitim postcontainer']" ); 

 foreach($links as $results){
  $newDom = new DOMDocument;
  $newDom->appendChild($newDom->importNode($results,true));

  $xpath = new DOMXpath ($newDom);
  $time_stamp = substr($xpath->query("div/div/span/span")->item(0)->nodeValue,0,10);
  $user_name = trim($xpath->query("div/div[2]/div/div/div/a/strong/font")->item(0)->nodeValue);

  $return[] = array(
   'time_stamp' => $time_stamp,
   'username' => $user_name,
   );
 }

 foreach ($return as $output) {
  echo "<strong>Time posted: " . $output['time_stamp'] . " by " . $output['username'] . "</strong>";
  //make your database entry
  $time_stamp = $output['time_stamp'];
  list($month, $day, $year) = split('[/.-]', $time_stamp);
  $time_stamp = $year."-".$month."-".$day;
  echo $time_stamp;

  $username = $output['username'];
  $sql="INSERT INTO lovesystems VALUES ('$username','$url','$time_stamp')";
  if (mysql_query($sql)) echo "Successfully input user in database!<br/>";
  else echo mysql_error();
 }
  }
}

// Now, create an instance of the class, set the behaviour
// of the crawler (see class-reference for more methods)
// and start the crawling-process.

$crawler = &new MyCrawler();

// URL to crawl
$crawler->setURL("http://######.com");

// Only receive content of files with content-type "text/html"
// (regular expression, preg)
$crawler->addReceiveContentType("/text\/html/");

// Ignore links to pictures, dont even request pictures
// (preg_match)
$crawler->addNonFollowMatch("/.(jpg|gif|png)$/ i");

// Store and send cookie-data like a browser does
$crawler->setCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
//$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();




// At the end, after the process is finished, we print a short
// report (see method getReport() for more information)

$report = $crawler->getReport();

echo "Summary:<br>";
if ($report["traffic_limit_reached"]==true)
  echo "Traffic-limit reached <br>";

echo "Links followed: ".$report["links_followed"]."<br>";
echo "Files received: ".$report["files_received"]."<br>";
echo "Bytes received: ".$report["bytes_received"]."<br>";

?>

A:

Check your server's configuration. I'm pretty sure Apache has a script timeout in it's configuration.

Samuel 2010-10-30 18:08:41

related questions

IDE suggestions: Eclipse IDE vs. Zend Studio ( confused )

MySQL/Apache Error in PHP MySQL query

Lightweight IDE for Linux

What PHP framework would you choose for a new application and why?

Why is my ternary expression not working?

How can I get at the matches when using preg_replace in PHP?

Mechanisms for tracking DB schema changes

Wordpress theme development offline tools

Using object property as default for method property

How can I get the authenticated user name under Apache using plain HTTP authentication and PHP?

Make XAMPP/Apache serve file outside of htdocs

How do you debug PHP scripts?

PHP Variables passed by value or by reference?

Best way to implement unit testing in PHP

Connect PHP to an AS/400

Best way to access Exchange using PHP?

PHP Session Security

How do I access a remote form in php?

What's the best way to generate a tag cloud from an array? (using h1 through h6 for sizing)

Apache/PHP: error_log per Virtual Host?

How do I track file downloads with apache/PHP

How would you access Object properties from within an object method?

Flat File Databases in PHP

Best way to allow plugins for a PHP application

Latest information on PHP upcoming releases