tags:

views:

89

answers:

4

Hi

Wonder if anyone can help me out with a little cron issue i am experience

The problem is that the load can spike up to 5, and the CPU usage can jump to 40%, on a dual core 'Xeon L5410 @ 2.33GHz' with 356Mb RAM, and I'm not sure where I should be tweaking the code and which way to prevent that. code sample below

//Note $productFile can be 40Mb .gz compressed, 700Mb uncompressed (xml text file) if (file_exists($productFile)) {

  $fResponse = gzopen($productFile, "r");
  if ($fResponse) {

     while (!gzeof($fResponse)) {

        $sResponse = "";
        $chunkSize = 10000;
        while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
          $sResponse .= gzgets($fResponse, 4096);
        }
        $new_page .= $sResponse;
        $sResponse = "";
        $thisOffset = 0;
        unset($matches);

        if (strlen($new_page) > 0) {

           //Emptying
           if (!(strstr($new_page, "<product "))) {
              $new_page = "";
           }

           while (preg_match("/<product [^>]*>.*<\/product>/Uis", $new_page, $matches, PREG_OFFSET_CAPTURE, $thisOffset)) {

              $thisOffset = $matches[0][1];
              $thisLength = strlen($matches[0][0]);
              $thisOffset = $thisOffset + $thisLength;

              $new_page   = substr($new_page, $thisOffset-1);
              $thisOffset = 0;

              $new_page_match = $matches[0][0];

              //- Save collected data here -//

              }

           }//End while loop


        }

     }
     gzclose($fResponse);
  }

}

$chunkSize - should it be as small as possible to keep the memory usage down and ease the regular expression, or should it be larger to avoid the code taking too long to run.

With 40,000 matches the load/CPU spikes. So does anyone have any advice on how to manage large feed uploads via crons.

Thanks in advance for your help

A: 

Since you said you're using lamp, may I suggest an answer to one of my questions: http://stackoverflow.com/questions/752337/suggestions-tricks-for-throttling-a-php-script/752441#752441

He suggests using the nice command on the offending script to lower the chances of it bogging down the server.

An alternative would be to profile the script and see where any bottlenecks are. I would recommend xdebug and kcachegrind or webcachegrind. There are countless questions and websites available to help you setup script profiling.

Mike B
A: 

You may also want to look at PHP's SAX event-based XML parser - http://uk3.php.net/manual/en/book.xml.php

This is good for parsing large XML files (we use it for XML files of a similar size that you are dealing with) and it does a pretty good job. No need for regexes then :)

You'd need to uncompress the file first before processing it.

simonrjones
+1  A: 

You have at least two problems. The first is you're trying to decompress the entire 700 MB file into memory. In fact, you're doing this twice.

while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
    $sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;

Both $sResponse and $new_page will hold a string that will eventaully contain the entire 700 MB file. So that's 1.4 GB of memory you're eating up by the time the script finishes running, not to mention the cost of string concatenation (while PHP handles strings better than other languages, there are limits to what mutable vs. non-mutable will get you)

The second problem is you're running a regular expression over the increasingly large string in $new_page. This will put increased load on the server as $new_page gets larger and larger.

The easiest way to solve your problems is to split up the tasks.

  1. Decompress the entire file to disk before doing any processing.

  2. Use a steram based XML parser, such as XMLReader or the old SAX Event Based parser.

  3. Even with a stream/event based parser, storing the results in memory may end up eating up a lot of ram. In that case you'll want to take each match and store it on disk/in-a-database.

Alan Storm
A: 

Re Alan. The script will never hold 700Mb in memory as it looks like $sResponse is cleared instantly after it reaches $chunkSize and has been added to $new_page,

$new_page .= $sResponse;
$sResponse = "";

and $new_page is strlen reduced once each match is found and cleared if there are no possible matches, for each $chunkSize chunk of data.

$new_page   = substr($new_page, $thisOffset-1);

if (!(strstr($new_page, "<product "))) {
   $new_page = "";
}

Although I can't say I can see where the actual problem lies.

Bill