I am developing an application that is, to put it simply, a niche based search engine. Within the application I have include a function crawl() which crawls a website and then uses the collectData() function to store the correct data from the site in the "products" table as described in the function. The visited pages are stored in a database
The crawler works pretty well, just as described except for two things: Timeout and Memory. I've managed to correct the timeout error but the memory remains. I know simply increasing the memory_limit is not actually fixing the problem.
The function is run by visiting "EXAMPLE.COM/products/crawl".
Is a memory leak inevitable with a PHP Web crawler? OR is there something I'm doing wrong/not doing.
Thanks in advance. (CODE BELOW)
function crawl() {
$this->_crawl('http://www.example.com/','http://www.example.com');
}
/***
*
* This function finds all link in $start and collects
* data from them as well as recursively crawling them
*
* @ param $start, the webpage where the crawler starts
*
* @ param $domain, the domain in which to stay
*
***/
function _crawl($start, $domain) {
$dom = new DOMDocument();
@$dom->loadHTMLFile($start);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");//get all <a> elements
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href'); // get href value
if(!(strpos($url, 'http') !== false)) { //check for relative links
$url = $domain . '/' . $url;
}
if($this->Page->find('count', array('conditions' => array('Page.url' => $url))) < 1 && (strpos($url, $domain) !== false)) { // if this link has not already been crawled ( exists in database)
$this->Page->create();
$this->Page->set('url',$url);
$this->Page->set('indexed',date('Y-m-d H:i:s'));
$this->Page->save(); // add this url to database
$this->_collectData($url); //collect this links data
$this->_crawl($url, $domain); //crawl this link
}
}
}