ansaurus

Question

Problems importing large xml feeds (LAMP)

Answer 1

A:

Since you said you're using lamp, may I suggest an answer to one of my questions: http://stackoverflow.com/questions/752337/suggestions-tricks-for-throttling-a-php-script/752441#752441

He suggests using the nice command on the offending script to lower the chances of it bogging down the server.

An alternative would be to profile the script and see where any bottlenecks are. I would recommend xdebug and kcachegrind or webcachegrind. There are countless questions and websites available to help you setup script profiling.

Mike B 2009-10-27 22:48:21

Answer 2

A:

You may also want to look at PHP's SAX event-based XML parser - http://uk3.php.net/manual/en/book.xml.php

This is good for parsing large XML files (we use it for XML files of a similar size that you are dealing with) and it does a pretty good job. No need for regexes then :)

You'd need to uncompress the file first before processing it.

simonrjones 2009-10-27 23:28:58

Answer 3

+1 A:

You have at least two problems. The first is you're trying to decompress the entire 700 MB file into memory. In fact, you're doing this twice.

while (!gzeof($fResponse) && (strlen($sResponse) < $chunkSize)) {
    $sResponse .= gzgets($fResponse, 4096);
}
$new_page .= $sResponse;

Both $sResponse and $new_page will hold a string that will eventaully contain the entire 700 MB file. So that's 1.4 GB of memory you're eating up by the time the script finishes running, not to mention the cost of string concatenation (while PHP handles strings better than other languages, there are limits to what mutable vs. non-mutable will get you)

The second problem is you're running a regular expression over the increasingly large string in $new_page. This will put increased load on the server as $new_page gets larger and larger.

The easiest way to solve your problems is to split up the tasks.

Decompress the entire file to disk before doing any processing.
Use a steram based XML parser, such as XMLReader or the old SAX Event Based parser.
Even with a stream/event based parser, storing the results in memory may end up eating up a lot of ram. In that case you'll want to take each match and store it on disk/in-a-database.

Alan Storm 2009-10-27 23:48:51

Answer 4

A:

Re Alan. The script will never hold 700Mb in memory as it looks like $sResponse is cleared instantly after it reaches $chunkSize and has been added to $new_page,

$new_page .= $sResponse;
$sResponse = "";

and $new_page is strlen reduced once each match is found and cleared if there are no possible matches, for each $chunkSize chunk of data.

$new_page   = substr($new_page, $thisOffset-1);

if (!(strstr($new_page, "<product "))) {
   $new_page = "";
}

Although I can't say I can see where the actual problem lies.

Bill 2009-10-28 09:38:30

ansaurus

tags:

views:

answers:

Problems importing large xml feeds (LAMP)

related questions