tags:

views:

77

answers:

5

I have this big function (1300+ lines of code) that takes data from the web and insert it into a local database. Each time the function runs its takes something like 20 seconds to complete and I need to run this function like a million times, so I use set_time_limit(0) to set the PHP time limit to infinite and I loop the function a million times, like this:

for ($ID= '01'; $ID < '999999'; $ID++) {
    getDataFromWeb($conn, $ID);
}

So whats the problem? The problem is that there are a million things that can go wrong and it always does go wrong, and suddenly the code gets stuck in ID 23465 for example, and it just stop getting data but I don't get any kind of error, its like the loop continues but without inserting anything to database, and because of the 'no time limit' I set to PHP then it never stops.

I want to know how I can detect this kind of problem, stop all and show alert. If a I set the time before the function starts and then check it when the function ends, like this:

for ($ID= '01'; $ID < '999999'; $ID++) {
    $time_start = microtime();
    getDataFromWeb($conn, $ID);
    $time_end = microtime();
    if ($time_alert - //... somehow check how time does it takes and stop if its taking too much
}

It will not work because if the function never completes then $time_end will never be set and so on...

So, help please?

A: 

If getDataFromWeb($conn, $ID); uses libs like libcurl or similar, than maybe it's a good idea to set connection timelimit there? Or for debug just echo '.' to know that function've been finished and exited.

Gtx
if I put connection time limit then I need to manually run the function all the time instead of just running the code overnight..
Jonathan
no you won't. it will just break over-timed connection. if it possible you could throw and catch an exception when socket/api-class somehow return "Time out" and push it to database as error. and whole script would continue to run the cycle...
Gtx
+1  A: 

I'd try http://www.php.net/manual/en/function.set-time-limit.php#92949.

ghaxx
The only thing the function returns each time it runs are some PHP notices (not an error, a simple notice that doesn't harm at all). Those notices are 'output' or not?
Jonathan
They're not output, but they can be captured, too. Putting @ before function name should enable you to capture warning messages from ob_start() forth: ob_start(); @func(); ob_end_clean();
ghaxx
A: 

Okay - there are several things here that are red flags in my mind.

First - You weren't kidding when you said you were looping this 1 million times. That surprised me.

Second - This loop looks weird to me:

for ($ID= '01'; $ID < '999999'; $ID++)

Why not instead do:

for ($ID = 1; $ID < 999999; $ID++)

I don't see why you're using Strings for Integer counting.

Third - How are you executing this? Is it from a browser or from CLI

Lastly - Without seeing the code it's hard to say what's going on, but does the function return a true/false boolean when complete, or are their other triggers like echo statements (at the minimum) in the function that will print debug information so you can track the progress.

You may want to simplify the code in the getDataFromWeb function it sounds like it's running some kind of cURL request, parsing that data, and placing it into the "$conn" database. Might be easier to not only understand but read if you chunked specific tasks from that function into separate functions (Or made a class) One for getting the data, one for "cleaning" the data, and one for entering the data into the database. If a function has too many tasks then issues like this (Debugging) become a nightmare.

Marco Ceppi
I changed the 'string' to integer. The function can return a number from 0 to 9 depending of amount data fetched. I'm executing this from a localhost browser. I dont know if its cURL because I'm using an API I don't really understand, but I know that it uses a 'browser emulator'.
Jonathan
A: 

Do you have any mysql_error()/mysql_errno() functions in your getDataFromWeb() function? Such as

if(mysql_errno($conn))
{ 
  echo mysql_errno($conn) . ": " . mysql_error($conn);
}

From http://php.net/manual/en/function.mysql-error.php

To stop the fuction replace the echo with die.

Nick Pyett
A: 

Side note: The supplied code will not loop 1,000,000 times. The following will:

for( $id=1 ; $id<=1000000 ; $id++ ) {
    getDataFromWeb( $conn , $id );
}

Also, with regards to your need to have this script run constantly to load content into a database, I would suggest the following:

  • I presume that you are using an SQL Table to hold the URLs to be crawled,
  • Add a field with a timestamp called 'loadAttempted',
  • Limit the PHP Script to try and perform the action to maybe 5 times,
  • Record the Time the Script attempt to crawl the URL into the 'loadAttempted' field,
  • Have each loop of the Script perform a search for any URLs where 'loadAttempted' is empty, or where it is greater than X minutes ago,
  • Add a CRON Job to trigger the Script

This would mean that, up to every minute, the script will be triggered and will try and load 5 URLs. If a URL takes an abnormally long period of time to load (which would mean that the script timed out whilst trying to crawl it) it will cycle back around and be tried again.

You could also use this, or variants on the idea, to get stats for pages which are slower than the rest and/or the average loadtime for the URLs.

Also, if you are wanting to have this running constantly, I would suggest that limiting the PHP script to try and run the getDataFromWeb() function a smaller number of times (like 5)

Lucanos
SQL tables are used. I have a benchmark table with all the load times. How to limit the script to try to run the function 5 times inside the loop? No idea how to use a CRON, help?
Jonathan
Within the SQL Query, extracting a group of **X** URLs to extract, you would include `... ORDER BY 'loadAttempted' ASC LIMIT X`, this would mean that all the URLs would be sorted by the last time they were loaded (with the ones loaded most recently at the bottom, and the least recently at the top) and would then skim the top **X** rows off the results.CRON Usage - A couple of tutorials: http://webhostingrating.com/hosting-guide/cpanel/creating-cron-jobs-in-cpanel/ http://www.upstartblogger.com/how-to-create-a-cron-job-in-cpanel
Lucanos