views:

3374

answers:

3

Hi all,

I am building a website in CakePHP that processes files uploaded though an XML-RPC API and though a web frontend. Files need to be scanned by ClamAV, thumbnails need to be generated, etcetera. All resource intensive work that takes some time for which the user should not have to wait. So, I am looking into asynchronous processing with PHP in general and CakePHP in particular.

I came across the MultiTask plugin for CakePHP that looks promising. I also came across various message queue implementations such as dropr and beanstalkd. Of course, I will also need some kind of background process, probably implemented using a Cake Shell of some kind. I saw MultiTask using PHP_Fork to implement a multithreaded PHP daemon.

I need some advice on how to fit all these pieces together in the best way.

  • Is it a good idea to have a long-running daemon written in PHP? What should I watch out for?
  • What are the advantage of external message queue implementations? The MultiTask plugin does not use an external message queue. It rolls it's own using a MySQL table to store tasks.
  • What message queue should I use? dropr? beanstalkd? Something else?
  • How should I implement the backend processor? Is a forking PHP daemon a good idea or just asking for trouble?

My current plan is either to use the MultiTask plugin or to edit it to use beanstald instead of it's own MySQL table implementation. Jobs in the queue can simply consist of a task name and an array of parameters. The PHP daemon would watch for incoming jobs and pass them out to one of it's child threads. The would simply execute the CakePHP Task with the given parameters.

Any opinion, advice, comments, gotchas or flames on this?

+1  A: 

If you use a message queue like beanstalkd, you can start as many processes as you'd like (even on the same server). Each worker process will take one job from the queue and process it. You can add more workers and more servers if you need more capacity.

The nice thing about using a single threaded worker is that you don't have to deal with synchronization inside a process. The jobqueue will make sure no job will be handled twice.

Peter Stuifzand
A: 

Might also be worth checking out Amazon SQS to be used in conjunction with EC2?

neilcrookes
No thanks. I wish to be self reliant. No dependencies on outside services except for an ISP with a rack and a big, fat pipe.
Sander Marechal
I understand that SQS can also have some significant latencies. Not a problem if you are transcoding videos, or sound, more so if you're fetching info as people login.
Alister Bulman
+5  A: 

I've had excellent results with BeanstalkD and a back-end written in PHP to retrieve jobs and then act on them. I wrapped the actual job-running in a bash-script to keep running if even if it exited (unless I do a 'exit(UNIQNUM);', when the script checks it and will actually exit). In that way, the restarted PHP script clears down any memory that may have been used, and can start afresh every 25/50/100 jobs it runs.

A couple of the advantages of using it is that you can set priorities and delays into a BeanstalkD job - "run this at a lower priority, but don't start for 10 seconds". I've also queued a number of jobs up at the some time (run this now, in 5 seconds and again after 30 secs).

With the appropriate network configuration (and running it on an accessible IP address to the rest of your network), you can also run a beanstalkd deamon on one server, and have it polled from a number of other machines, so if there are a large number of tasks being generated, the work can be split off between servers. If a particular set of tasks needs to be run on a particular machine, I've created a 'tube' which is that machine's hostname, which should be unique within our cluster, if not globally (useful for file uploads). I found it worked perfectly for image resizing, often returning the finished smaller images to the file system before the webpage itself that would refer to it would refer to the URL it would be arriving at.

I'm actually about to start writing a series of articles on this very subject for my blog (including some techniques for code that I've already pushed several million live requests through) - My URL is linked from my user profile here, on Stackoverflow.

(I've started on the series of articles)

Alister Bulman
Thanks, that was helpful. I am working with Beanstalkd at the moment as well. So far I have created a simple CakePHP Model behaviour called a "deferred", which is just a delayed method call on a Model. The deferred behaviour puts the call in beanstald and a Cake Shell running in the background executes gets the messaged from beanstalkd and executed the calls. My only worry so far is that Beanstalkd is not persistant. Have you had any problems with that? What if beanstalkd dies and some of you images are never resized?
Sander Marechal
it's never died on me yet, even when I put 100,000 strings into it. Also, if the image doesn't get resized, it's still in the upload directory, and can be done later.
Alister Bulman
just for reference, I think that Beanstalkd site you linked to has changed as it doesn't seem relevant to any sort of php programming project
Rick
@rick - oops - it wasn't actually a valid URL. Fixed.
Alister Bulman