views:

403

answers:

5

I have a php script that needs to run for quite some time.

What the script does:

  • connects to mysql
  • initiates anywhere from 100 to 100,000 cURL requests
  • each cURL request returns compact-decoded data of 1 to 2000 real estate listings - i use preg-match-all to get all the data and do one mysql insert per listing. each query never exceeds more than 1mb of data.

So there are a lot of loops, mysql inserts, and curl requests going on. php safe mode is off and I am able to successfully ini_set the max-execution-time to something ridiculous to allow my script to run all the way through.

Well, my problem is the script or apache or something is having a stroke in the middle of the script and the screen goes to the "connection to the server has been reset" screen.

Any ideas?

A: 

What's in the apache error_log? Are you reaching the memory limit?

EDIT: Looks like you are reaching your memory limit. Do you have access to PHP.ini? If so, you can raise the memory_limit there. If not, try running curl or wget binaries using the exec or shell_exec functions, that way they run as separate processes, not using PHP's memory.

Josh
Yes. I'm a noob sorry:Allowed memory size of 100663296 bytes exhausted (tried to allocate 2975389 bytes)
John
This might sound even more noobish but can't i just ob_flush/flush throughout the script or at certain parts of it?
John
@John: No, the buffer is only one part of the memory being used. The cURL functions use quite a bit of memory all to themselves.
Nathan Kleyn
Well why does it keep every request in the memory? Wouldn't it get rid of the old request in the memory once it starts a new one?
John
@John: No. Because cURL is an outside library, it makes it very tricky for the memory management model of PHP to dispose of correctly. Often this means that if the cURL calls are not enclosed inside a block (a class, even a function), they will not be disposed of correctly.
Nathan Kleyn
@John: See my edited answer for advice on how to raise your memory limit, or bring down your memory usage.
Josh
Josh, thanks for all the advice. It took me forever to get this curl stuff working. It has to use cookies to keep me logged in. Not sure if I can do that with wget... or how I would even go about doing that with wget. With my limited curl knowledge I just thought it would make sense to re-use the curl handle but the fact that it doesn't release the memory until the very end is messing me up :(
John
A: 

100,000 cURL requests??? You are insane. Break that data up!

Byron Whitlock
everytime the client adds a new MLS it has to get anywhere from 1000 to 10,000 listings - i can get all of the listings in about 5 cURL requests.. but i have to do 1 cURL request per listing to get the images for it.
John
@John: What about writing a class then which contains the functionality to retrieve one item at a time. You could loop over all the listings and instantiate the class once for each one, ensuring in the process that when the class is destroyed the cURL memory gets freed too.
Nathan Kleyn
@John: Basically, you just want to make sure that you're not retrieving the same data over and over, wasting cycles and bandwidth in the process. By setting up a job queue of some description, and storing each retrieved page in a database, you can prevent this easily.
Nathan Kleyn
That would make the script run even longer because right now it does one cURL request to login and then it does all the cURL requests in a loop then it does a cURL request to logout. so instead of login (loop 20 times) logout //22 curl requests it's going to do 20*3 //60 curl requests - your suggestion would defintely help with the memory problem though :( there's got to be a way to free up the part of the memory i don't need anymore isn't there? after it does one thing idk why php trys to remember it until the end, seems excessive.
John
I never retrieve the same data twice.
John
@John: The problem with the way you're doing it now is that although it doesn't take as long as the method we're suggesting, it does it all at once and kills the server in the process. What we're proposing is to slow down the rate of request to one at a time or so, and split them into a single cURL call inside an individual class for each request to make sure the memory gets cleaned after each request.
Nathan Kleyn
Yeah I ini_set the memory to something ridiculous and made my script email me the get_memory_usage every time it made a curl request so I could see where it's dieing at but.. even with the ini_set memory thing the script still dies.. I'm hard headed, sorry, but it looks like I'm going to have to take the "Nathan" route on this one :(
John
+2  A: 

Lots of ideas:

1) Don't do it inside an HTTP request. Write a command-line php script to drive it. You can use a web-bound script to kick it off, if necessary.

2) You should be able to set max_execution_time to zero (or call set_time_limit(0)) to ensure you don't get shut down for exceeding a time limit

3) It sounds like you really want to refactor this into a something more sane. Think about setting up a little job queueing system, and having a php script that forks several children to chew through all the work.

As Josh says, look at your error_log and see why you're being shut down right now. Try to figure out how much memory you're using -- that could be a problem. Try setting the max_execution_time to zero. Maybe that will get you where you need to be quickly.

But in the long run, it sounds like you've got way too much work to do inside of one http request. Take it out of http, and divide and conquer!

timdev
didn't know about that 0 trick, good to know. not sure how to go about doing this outside of having it in a php script.
John
+3  A: 

Well, disregarding the fact that attempting 100,000 cURL requests is absolutely insane, you're probably hitting the memory limit.

Try setting the memory limit to something more reasonable:

ini_set('memory_limit', '256M');

And as a side tip, don't set the execution time to something ludicrous, chances are you'll eventually find a way to hit that with a script like this. ;]

Instead, just set it to 0, it functionally equivalent to turning the execution limit off completely:

ini_set('max_execution_time', 0);
Nathan Kleyn
yes, i see now that i need to increase the memory limit, but is this a bad idea?
John
@John: Yes and no. Don't set it higher than you need it all the time as it prevents script errors from running forever. Imagine if you turned off the script execution time limiter and memory limit and accidentally ran a script with an infinity loop! Moral of the story here: use it sparingly for situations like this where nothing else would really work short of writing it to be distributed or executed over time. By the way, I second timdev's comment about setting up a job queueing system, that really is the way to do this.
Nathan Kleyn
This was a better answer than mine -- I forgot you could override `memory_limit` using `ini_set`
Josh
+1  A: 

You can set the timeout to be indefinate by modifying your PHP.ini and setting the script execution variable.

But you may also want to consider a slight architecture change. First consider a "Launch and forget" approach at getting 100,000 curl requests. Second, consider using "wget" instead of curl.

You can issue a simple "wget URL -o UniqueFileName &" This will retrieve a web page, save it to a "unique" filename and all in the background.

Then you can iterate over a directory of files, greping (preg_matching) data, and making your DB calls. Move the files as you process them to an archive and continue to iterate until there are no more files.

Think of the directory as a "queue" and have one process just process the files. Have a second process simply go out and grab web-page data. You could add a third process that can be you "monitor" which works independently and simply reports snap-shot statistics. The other two can just be "web services" with no interface.

This type of multi-threading is really powerful and greatly under-utilized IMHO. To me this is the true power of the web.

ChronoFish