views:

62

answers:

4

I am writing a php cron job that reads thousands of feeds / web pages using curl and stores the content in a database. How do I restrict the number of threads to, lets say, 6? i.e., even though I need to scan thousands of feeds / web pages, I want only 6 curl threads active at any time so that my server and network don't get bogged down. I could do it easily in Java using wait, notify, notifyall methods of Object. Should I build my own semaphore or does php provide any built-in functions?

+1  A: 

First of all, PHP doesn't have threads, but it does have process control: http://php.net/manual/en/book.pcntl.php

I've built a class around these functions to help with my multi-process requirements.

I'm in a similar situation. I'm keeping a log of the processes that get started from cron and their state. I'm checking on them from a related cron job.

EDIT (more details):

In my project I log all the key changes to the database. Actions may then be taken if the changes meet the actions criterion. So what I'm doing is different to you. However, there are some similarities.

When I fork a new process, I enter it's pid in a DB table. Then next time the cron job kicks in, part of what it does is check to see if the processes have completed properly, and then mark the action as completed in that DB table.

You don't give many details about your project. So I will just throw out a suggestion:

  • A DB table holds the URLs of the resources you want to download.
  • Another table holds the pids of the running processes.
  • A cron job that is run every hour will go through the table and download the resource and store it in a DB. However, first it checks the pid table for complete/dead/running processes and acts accordingly. Here you can limit your processes to 6.

Depending on the size of your project, this may seem like over kill. However, I've thought about it for a long long time, and I want to keep track of all those forked processes. Forking can be risky business, and can lead to system resource overload - speaking from experience ;)

I'd be interested to hear other techniques as well.

sims
Thanks for the quick answer. How about synchronizing the counter among the 6 processes? Two processes shouldn't attempt to update the counter at the same time.
Vasu
I've added some detail to the answer. I don't know how your project is structured, so I can't be sure this fits for you, but I hope it helps.
sims
+1  A: 

I ended up using http://github.com/LionsAd/rolling-curl library for my needs. No processes, no threads.

Vasu
Nice. That looks like an elegant solution.
sims
A: 

This looks like another job for gearman (gearman.org).

andreas
A: 
vlad b.