views:

269

answers:

7

I have a list of data that needs to be processed. The way it works right now is this:

  • A user clicks a process button.
  • The PHP code takes the first item that needs to be processed, takes 15-25 secs to process it, moves on to the next item, and so on.

This takes way too long. What I'd like instead is that:

  • The user clicks the process button.
  • A PHP script takes the first item and starts to process it.
  • Simultaneously another instance of the script takes the next item and processes it.
  • And so on, so around 5-6 of the items are being process simultaneously and we get 6 items processed in 15-25 secs instead of just one.

Is something like this possible?

I was thinking that I use CRON to launch an instance of the script every second. All items that need to be processed will be flagged as such in the MySQL database, so whenever an instance is launched through CRON, it will simply take the next item flagged to be processed and remove the flag.

Thoughts?

Edit: To clarify something, each 'item' is stored in a mysql database table as seperate rows. Whenever processing starts on an item, it is flagged as being processed in the db, hence each new instance will simply grab the next row which is not being processed and process it. Hence I don't have to supply the items as command line arguments.

A: 

You can use pcntl_fork() and family to fork a process - however you may need something like IPC to communicate back to the parent process that the child process (the one you fork'd) is finished.

You could have them write to shared memory, like via memcache or a DB.

You could also have the child process write the completed data to a file, that the parent process keeps checking - as each child process completes the file is created/written to/updated, and parent process can grab it, one at a time, and them throw them back to the callee/client.

The parent's job is to control the queue, to make sure the same data isn't processed twice and also to sanity check the children (better kill that runaway process and start over...etc)

Something else to keep in mind - on windows platforms you are going to be severely limited - I dont even think you have access to pcntl_ unless you compiled PHP with support for it.

Also, can you cache the data once its been processed, or is it unique data every time? that would surely speed things up..?

Mr-sk
Well, you don't really need IPC to tell when a child process is done. You can spawn several child processes, giving each one a file to put its result in, then use `pcntl-wait` (http://www.php.net/manual/en/function.pcntl-wait.php) to wait for the child processes to complete. When each one completes, pull in its data, and `pcntl-wait` again for the next one, keeping track until all children have returned.
pib
Yeah that's true - I gave a few examples of not using IPC ( in fact I also gave an example of using a file) but really, we're all still talking about shared memory.
Mr-sk
pcntl_fork can only be done on a unix system, am I right?
Martin
I want to say yes, but should check php.net - I think for windows we ended up using: http://us.php.net/manual/en/book.com.php Ugh, COM!
Mr-sk
I don't need to communicate back and forth. I just need a way of launching another instance of a php script. It will just take the next item flagged as needing to be processed from the db and process it. Any thoughts on how that'll be done?
Click Upvote
+1  A: 

There is no multithreading in PHP, however you can use fork.

php.net:pcntl-fork

Or you could execute a system() command and start another process which is multithreaded.

sberry2A
can you give some sample code or more detail on the system() function?
Click Upvote
+5  A: 
Mike
A nice solution, my only concern is the nohup, how can I check if my server has this installed? I don't need to pass the arguments via command line, they're stored in the db so the `proc.php` file will just fetch the next pending item from the db and process it, no arguments need to be passed on. Assuming that's what the php cli is needed for i wont need it.
Click Upvote
`nohup` is part of the coreutils package which includes such things as `ls`, `cd`, etc. I'd be surprised if any Linux server doesn't have it. All `nohup` does is: "run a command immune to hangups, with output to a non-tty". Since exec effectively runs things as though you were using a shell you need the PHP binary to be available to run your PHP scripts. If it isn't then there's a way around it as if `wget` is installed you can do the same trick but pull webpages from your server (you really don't want to do it that way if possible).
Mike
+5  A: 

Use an external workqueue like Beanstalkd which your PHP script writes a bunch of jobs too. You have as many worker processes pulling jobs from beanstalkd and processing them as fast as possible. You can spin up as many workers as you have memory / CPU. Your job body should contain as little information as possible, maybe just some IDs which you hit the DB with. beanstalkd has a slew of client APIs and itself has a very basic API, think memcached.

We use beanstalkd to process all of our background jobs, I love it. Easy to use, its very fast.

Cody Caughlan
Gearman is also a good solution for external work outsourcing.
Charles
You can also do this with redis nowadays because it has a blocking pop function. See http://simonwillison.net/2010/Jan/7/blocking/
Alfred
Can this be used automatically/programatically? I'm not looking for anything done through the command line, and looking to do as little sysadmin work as possible. The installation also looks hard?
Click Upvote
+1  A: 

can you implementing threading in javascript on the client side? seems to me i've seen a javascript library (from google perhaps?) that implements it. google it and i'm sure you'll find something. i've never done it, but i know its possible. anyway, your client-side javascript could activate (ajax) a php script once for each item in separate threads. that might be easier than trying to do it all on the server side.

-don

Don Dickinson
A: 

If you are running a high traffic PHP server you are INSANE if you do not use Alternative PHP Cache: http://php.net/manual/en/book.apc.php . You do not have to make code modifications to run APC.

Another useful technique that can work along with APC is using the Smarty template system which allows you to cache output so that pages do not have to be rebuilt.

Rook
Yup opcode cache for busy website is a MUST!
Alfred
Nothing @ all to do with the question asked
Click Upvote
A: 

To solve this problem, I've used two different products; Gearman and RabbitMQ.

The benefit of putting your jobs into some sort of queuing software like Gearman or Rabbit is that you have multiple machines, they can all participate in processing items off the queue(s).

Gearman is easier to setup, so I'd suggest poking around with it a bit first. If you find you need something more heavy duty with queue robustness; Look into RabbitMQ

sfrench