views:

744

answers:

3

I'm trying to read some large text files (between 50M-200M), doing simple text replacement (Essentially the xml I have hasn't been properly escaped in a few, regular cases). Here's a simplified version of the function:

<?php
function cleanFile($file1, $file2) {
$input_file     = fopen($file1, "r");
$output_file    = fopen($file2, "w");
  while (!feof($input_file)) {
    $buffer = trim(fgets($input_file, 4096));
    if (substr($buffer,0, 6) == '<text>' AND substr($buffer,0, 15) != '<text><![CDATA[')
    {
      $buffer = str_replace('<text>', '<text><![CDATA[', $buffer);
      $buffer = str_replace('</text>', ']]></text>', $buffer);
    }
   fputs($output_file, $buffer . "\n");
  }
  fclose($input_file);
  fclose($output_file);  
}
?>

What I don't get is that for the largest of files, around 150mb, PHP memory usage goes off the chart (around 2GB) before failing. I thought that this was the most memory efficient way to go about reading large files. Is there some method I am missing that would be more efficient for memory? Perhaps some setting that's keeping things in memory when it should be being collected?

In other words, it's not working and I don't know why, and as far as I know I am not doing things incorrectly. Any direction for me to go? Thanks for any input.

+2  A: 

PHP isn't really designed for this. Offload the work to a different process and call it or start it from PHP. I suggest using Python or Perl.

Randolpho
unfortunately, it's not an option at this point to choose another language. :(
jacobangel
Then do it with PHP in a separate process. The point is that you shouldn't be parsing that large file as part of your request. You should offload the work in a separate process, return a response, and then allow a second request to determine whether or not the process id done. Asynchronous FTW.
Randolpho
Agreed. My guess is that you are receving the file via ftp, batch process, etc. Why not parse the file as soon it lands on the file system instead of waiting for someone to pull it down from a web request.
matt eisenberg
heh.. just noticed the typo... I meant "process *is* done" not "process id done". :D
Randolpho
went with this option :)
jacobangel
Glad to hear it! :)
Randolpho
+1  A: 

From my meagre understanding of PHP's garbage collection, the following might help:

  1. unset $buffer when you are done writing it out to disk, explicitly telling the GC to clean it up.
  2. put the if block in another function, so the GC runs when that function exits.

The reasoning behind these recommendations is I suspect the garbage collector is not freeing up memory because everything is done inside a single function, and the GC is garbage.

freespace
Tried this. It did free up a bit of memory, but not enough. I wish I knew what precisely it was doing with the memory.
jacobangel
A: 

I expect this to fail in many cases. You are reading in chunks of 4096 bytes. Who knows that the cut-off will not be in the middle of a <text>? In which case your str_replace would not work.

Have you considered using a regular expression?

jeyoung