tags:

views:

664

answers:

4

Hey guys/girls,

Basically I need to get around max execution time.

I need to scrape pages for info at varying intervals, which means calling the bot at those intervals, to load a link form the database and scrap the page the link points to.

The problem is, loading the bot. If I load it with javascript (like an Ajax call) the browser will throw up an error saying that the page is taking too long to respond yadda yadda yadda, plus I will have to keep the page open.

If I do it from within PHP I could probably extend the execution time to however long is needed but then if it does throw an error I don't have the access to kill the process, and nothing is displayed in the browser until the PHP execute is completed right?

I was wondering if anyone had any tricks to get around this? The scraper executing by itself at various intervals without me needing to watch it the whole time.

Cheers :)

+1  A: 

Use set_time_limit() as such:

set_time_limit(0);
// Do Time Consuming Operations Here
Andrew Moore
Thanks dude I'll use this in combination with flush() Cheers :)
hamstar
+1  A: 

take a look at how Sphider (PHP Search Engine) does this.

Basically you will just process some part of the sites you need, do your thing, and go on to the next request if there's a continue=true parameter set.

SchizoDuckie
A: 

"nothing is displayed in the browser until the PHP execute is completed"

You can use flush() to work around this:

flush()

(PHP 4, PHP 5)

Flushes the output buffers of PHP and whatever backend PHP is using (CGI, a web server, etc). This effectively tries to push all the output so far to the user's browser.

Colin Pickard
Thanks dude I'll use this in combination with set_time_limit() Cheers :)
hamstar
You're welcome :)
Colin Pickard
A: 

run via CRON and split spider into chunks, so it will only do few chunks at once. call from CRON with different parameteres to process only few chunks.

dusoft