views:

75

answers:

1

I'm building a spider which will traverse various sites and data mining them.

Since I need to get each page separately this could take a VERY long time (maybe 100 pages). I've already set the set_time_limit to be 2 minutes per page but it seems like apache will kill the script after 5 minutes no matter.

This isn't usually a problem since this will run from cron or something similar which does not have this time limit. However I would also like the admins to be able to start a fetch manually via a HTTP-interface.

It is not important that apache is kept alive for the full duration, I'm, going to use AJAX to trigger a fetch and check back once in a while with AJAX.

My problem is how to start the fetch from within a PHP-script without the fetch being terminated when the script calling it dies.

Maybe I could use system('script.php &') but I'm not sure it will do the trick. Any other ideas?

+4  A: 
    $cmd = "php myscript.php $params > /dev/null 2>/dev/null &";

    # when we call this particular command, the rest of the script 
    # will keep executing, not waiting for a response
    shell_exec($cmd);

What this does is sends all the STDOUT and STDERR to /dev/null, and your script keeps executing. Even if the 'parent' script finishes before myscript.php, myscript.php will finish executing.

Erik
don't forget to use http://php.net/manual/en/function.escapeshellarg.php on $params
Andy
Thanks, that did the trick :)
Nicklas Ansman
NP, and welcome to SO
Erik
But it will fall on it's arse if the session terminates. And can no longer be signalled with a HUP (e.g. to stop gracefully). A better solution is to attach it to a different session header e.g. 'echo php myscript.php | at now'
symcbean
....and you would not believe how complicated and flaky Erik's suggestion will get if you launch it from apache.
symcbean
@symcbean: What do you mean by "if the session terminates"? The calling script can end and the called script continues to run to completion.
Erik
@symcbean: I'd like to learn, so why dont you back up your comments? I've this identical piece of code running on a production site for a year .. it gets called 100+ times per hour and its never exhibited any problems.
Erik
"If it terminates" The caking script should end if its started from a web page - but the **PROCESS** may continue to serve other requests before being recycled (how many depends on the config).http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html
symcbean
I can just report that this worked fine and the spider crawls just fine ever after the calling scripts has finished.Big thanks to you Erik :)
Nicklas Ansman