views:

969

answers:

10

I have a PHP script that takes a long time (5-30 minutes) to complete. Just in case it matters, the script is using cUrl to scrape data from another server. This is the reason it's taking so long; it has to wait for each page to load before processing it and moving to the next.

I want to be able to initiate the script and let it be until it's done, which will set a flag in a database table.

The problem is that I'm making the init call from an iPhone app, which doesn't like a request that takes so long. I want that initial request to be finished quickly.

So in summary, what I need to know is how to be able to end the http request before the script is finished running. Also, is a php script the best way to do this? (My server supports Ruby On Rails, Python, Perl, Curl<-not sure what this is referring to) Although I don't have any experience in the other languages.

+3  A: 

The quick and dirty way would be to use the ignore_user_abort function in php. This basically says: Don't care what the user does, run this script until it is finished. This is somewhat dangerous if it is a public facing site (because it is possible, that you end up having 20++ versions of the script running at the same time if it is initiated 20 times).

The "clean" way (at least IMHO) is to set a flag (in the db for example) when you want to initiate the process and run a cronjob every hour (or so) to check if that flag is set. If it IS set, the long running script starts, if it is NOT set, nothin happens.

FlorianH
So the "ignore_user_abort" method would allow the user to close the browser window, but is there something I could do to have it return an HTTP response to the client before it is finished running?
Kelso.b
+1  A: 

No, PHP is not the best solution.

I'm not sure about Ruby or Perl, but with Python you could rewrite your page scrapper to be multi-threaded and it would probably run at least 20x faster. Writing multi-threaded apps can be somewhat of a challenge, but the very first Python app I wrote was mutlti-threaded page scrapper. And you could simply call the Python script from within your PHP page by using one of the shell execution functions.

jamieb
The actual processing part of my scraping is very efficient. As I mentioned above, it's the loading of each page that kills me. What I was wondering is if PHP is meant to be run for such long periods.
Kelso.b
I'm a bit biased because since learning Python I outright loathe PHP. However, if you're scraping more than one page (in series), you're almost certain to get better performance by doing it in parallel with a multithreaded app.
jamieb
Any chance you could send me an example of such a page scraper? It would help me out aplenty seeing as I haven't yet touched Python.
Kelso.b
jo_dadday at hotmail dot com
Kelso.b
If I had to rewrite it, I'd just use eventlet. It's make my code about 10x simpler: http://www.eventlet.net/doc/
jamieb
+3  A: 

You could use exec or system to start a background job, and then do the work in that.

Also, there are better approaches to scraping the web that the one you're using. You could use a threaded approach (multiple threads doing one page at a time), or one using an eventloop (one thread doing multiple pages at at time). My personal approach using Perl would be using AnyEvent::HTTP.

ETA: symcbean explained how to detach the background process properly here.

Leon Timmermans
Nearly right. Just using exec or system will come back to bite you on the ass. See my reply for details.
symcbean
A: 

I have done similar things with Perl, double fork() and detaching from parent process. All http fetching work should be done in forked process.

Alexandr Ciornii
A: 

I agree with the answers that say this should be run in a background process. But it's also important that you report on the status so the user knows that the work is being done.

When receiving the PHP request to kick off the process, you could store in a database a representation of the task with a unique identifier. Then, start the screen-scraping process, passing it the unique identifier. Report back to the iPhone app that the task has been started and that it should check a specified URL, containing the new task ID, to get the latest status. The iPhone application can now poll (or even "long poll") this URL. In the meantime, the background process would update the database representation of the task as it worked with a completion percentage, current step, or whatever other status indicators you'd like. And when it has finished, it would set a completed flag.

Jacob
+8  A: 

Certainly it can be done with PHP, however you should NOT do this as a background task - the new process has to be dissocated from the process group where it is initiated.

Since people keep giving the same wrong answer to this FAQ, I've written a fuller answer here:

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

C.

symcbean
+1 A rock solid and detailed answer on that blogpost.
wimvds
A: 

You can send it as an XHR (Ajax) request. Clients don't usually have any timeout for XHRs, unlike normal HTTP requests.

Alex JL
A: 

PHP may or may not be the best tool, but you know how to use it, and the rest of your application is written using it. These two qualities, combined with the fact that PHP is "good enough" make a pretty strong case for using it, instead of Perl, Ruby, or Python.

If your goal is to learn another language, then pick one and use it. Any language you mentioned will do the job, no problem. I happen to like Perl, but what you like may be different.

Symcbean has some good advice about how to manage background processes at his link.

In short, write a CLI PHP script to handle the long bits. Make sure that it reports status in some way. Make a php page to handle status updates, either using AJAX or traditional methods. Your kickoff script will the start the process running in its own session, and return confirmation that the process is going.

Good luck.

daotoad
A: 

I have create a c++ service that can be used to run PHP scripts that need to run for long periods of time.

See http://jose.ydra.org/projects/PhpRunner

José
A: 

Use a proxy to delegate the request.

deepsat