views:

129

answers:

2

I have a mysql database table filled with 1000+ records, lets say 5000 records. Each record has a processed boolean flag, default to false (0). What I'd like to do is have a PHP script run on cron every minute. Its code would be something like this:

<?php
process();

function process()
{
   $sql = "SELECT id FROM items WHERE processed = '0' ORDER BY id ASC LIMIT 1";
   $result = $this->db->query($sql);

   if (! $result->has_rows())
     die;

   $id = $result->getSingle('id');
   processItem($id); //Will set processed to 1 after processing is done
   process();
}
?>

It should be pretty clear what the above code does, it gets the id for the next record which is un-processed, processes it, and then calls the process() function again which repeats this process until there are no more items to be processed, at which point the execution would stop.

By putting this script on Cron to run every minute, I hope to have multiple instances of this script all working at processing the items simultaneously, so rather than processing one item at a time, 5-10+ items could be getting processed simultaneously.

1) Is this going to work the way I'm planning it? Any suggestions for improvements / things to watch out for?

2) Should I have the script set a counter for the number of running instances, so whenever a cron job starts, it checks the counter, if 50 (?) instances are running it would exit without processing. That might keep the server from crashing by having too many running processes using up too much memory? Any thoughts?

+6  A: 

I have a couple of things to say:

Firstly you are using recursion to process multiple rows. This could lead to problems if you recurse too deep. Instead use a simple loop.

Secondly, do you know if this code can benefit from being run multiple times? If the machine is CPU bound it might not benefit from another thread. I suggest you check manually how many threads work best. More threads does not always make things go faster and in some cases can actually slow everything down.

Finally, I would certainly put a limit on how many of these scripts can run concurrently. This can be achieved simply by ensuring each script runs for no longer than say 5 minutes. Or you can keep a count of active scripts and ensure it doesn't go over the maximum number you determined in my second suggestion.

Edit: I've added some more information about the problem recursion can cause: Each time you recursively call a function extra space is used up on the stack. This space stores any local variables as well as the address of the function (allowing it to restore the state when the called function exits). The stack only has a finite amount of space so eventually your program will crash with a stack overflow. Try running this simple program:

function a($i) { 
   print $i . "\n"; 
   a($i + 1);
}
a(0);

On my system it crashes PHP after 608739 iterations. This number could be a lot smaller in a more complex function. A simple loop does not have these overheads therefore it does not have this problem.

bramp
Thanks for the answer. What problems can recursion cause?
Click Upvote
I've answered your question about what problems recursion can cause in the main answer body.
bramp
Thanks mate, especially the tip about checking if a script has been running for 5 mins and terminating if it has been. Appreciate your help :)
Click Upvote
+1  A: 

The recursion doesn't seem necessary at all, and like bramp said, could lead to problems. Why not just

$sql = "SELECT id FROM items WHERE processed = '0' ORDER BY id ASC LIMIT 1";

while ( ($result = $this->db->query($sql) && $result->has_rows() ) {
   processItem( $result->getSingle('id') );
}

However, I foresee larger problems here. If you are going to be running this script every minute, what mechanism do you have in place to stop execute of previously executed scripts that may still be running? You may end up processing the same ID more than once.

If you absolutely require a (pseudo) multi-threaded approach, I suggest the following:

  1. Grab a range of or all of the unprocessed ID's, not just one at a time.
  2. Using the curl_multi_ family of functions, pass subsets of the above results (groups of n id's) to another script to do the actual processing.

This method allows you to have greater control over the whole process, and prevents the unnecessary single-querying to get the unprocessed ID's.

Justin Johnson