views:

469

answers:

6

Hi everybody,

in a PHP program, I sequentially read a bunch of files (with file_get_contents), gzdecode them, json_decode the result, analyze the contents, throw the most of it away, and store about 1% in an array. Unfortunately, with each iteration (I traverse over an array containing the filenames), there seems to be some memory lost (according to memory_get_peak_usage, about 2-10 MB each time). I have double- and triplechecked my code, I am not storing unneded data in the loop (and the needed data hardly exceeds about 10MB overall), but I am frequently rewriting (actually, strings in an array). Apparently, PHP does not free the memory correctly, thus using more and more RAM until it hits the limit. Is there any way to do a forced garbage collection? Or, at least, to find out where the memory is used?

Thanks in advance, Dmitri

+1  A: 

Call memory_get_peak_usage() after each statement, and ensure you unset() everything you can. If you are iterating with foreach(), use a referenced variable to avoid making a copy of the original (foreach()).

foreach( $x as &$y)

If PHP is actually leaking memory a forced garbage collection won't make any difference.

There's a good article on PHP memory leaks and their detection at IBM

Andy
Using unset() is a good solution, but you still rely on the GC. You may also try to assign the variables you don't need anymore to NULL. The memory may be reclaimed faster.
Macmade
The IBM article basically says "use `memory_get_peak_usage` to locate the leaks, which is not very helpful, as I already seem to have located it - however, I have no idea how to get rid of a memory leak in an internal PHP function...
DBa
If it's internal to a PHP function you can't get rid of it, it's a bug in the language! If you have detected the leak, perhaps you have identified a function you should a) try to find a equivalent of b) report @ http://bugs.php.net/ Perhaps you should post the code you're having trouble with?
Andy
+4  A: 

In PHP >= 5.3.0, you can call gc_collect_cycles() to force a GC pass.

Mo
Tried - no effect.
DBa
+1  A: 

I've found that PHP's GC is most-likely to be invoked upon completion of a function. Knowing that, I've refactored code in a loop like so:

while (condition) {
  // do
  // cool
  // stuff
}

to

while (condition) {
  do_cool_stuff();
}

function do_cool_stuff() {
  // do
  // cool
  // stuff
}

EDIT

I ran this quick benchmark and did not see an increase in memory usage. This leads me to believe the leak is not in json_decode()

for($x=0;$x<10000000;$x++)
{
  do_something_cool();
}

function do_something_cool() {
  $json = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
  $result = json_decode($json);
  echo memory_get_peak_usage() . PHP_EOL;
}
Mike B
This reduced the leak, but has not fixed it entirely... Apparently, something is leaking inside the `json_decode` function - is there any alternative implementation? I do not care if it is a bit slower, as long as it does not eat up memory (currently, the program hits 1 GB mark at 60% of processing, causing the machine to swap and thus growing _VERY_ slow... There is nothing which would justify such a memory use, the chunks read are all about 10 MB and they are processed subsequently).
DBa
@DBa I updated my answer
Mike B
Mike, I tried the same and haven't been able to reproduce the leak with a "simple" approach (fuzzing around with a simple array) either. Will try to run it with my input data, maybe that's the problem.
DBa
While trying to reproduce the whole thing, I eventually found the solution: it was a string concatenation. I was generating the input line by line by concatenating some variables (the output is a CSV file). However, PHP seems not to free the memory used for the old copy of the string, thus effectively clobbering RAM with unused data. Switching to an array-based approach (and imploding it with commas just before fputs-ing it to the outfile) circumvented this behavior.
DBa
A: 

I was going to say that I wouldn't necessarily expect gc_collect_cycles() to solve the problem - since presumably the files are no longer mapped to zvars. But did you check that gc_enable was called before loading any files?

I've noticed that PHP seems to gobble up memory when doing includes - much more than is required for the source and the tokenized file - this may be a similar problem. I'm not saying that this is a bug though.

I believe one workaround would be not to use file_get_contents but rather fopen()....fgets()...fclose() rather than mapping the whole file into memory in one go. But you'd need to try it to confirm.

HTH

C.

symcbean
A: 

Found the solution: it was a string concatenation. I was generating the input line by line by concatenating some variables (the output is a CSV file). However, PHP seems not to free the memory used for the old copy of the string, thus effectively clobbering RAM with unused data. Switching to an array-based approach (and imploding it with commas just before fputs-ing it to the outfile) circumvented this behavior.

For some reason - not obvious to me - PHP reported the increased memory usage during json_decode calls, which mislead me to the assumption that the json_decode function was the problem.

DBa
Do you mind giving some more detail about this? It might help me out.You were resetting an existing string variable in each iteration of your loop, but the memory used to hold the old string(s) was not being released - was that the problem? Now, using an array to hold the data it does free the memory?
Scott Saunders
Yes, indeed. Actually, I tried two approaches (both bloated memory) with strings: `$outstring .= ($rec->id).','.($rec->name).','. [.....] .'\n' ; ... fputs($handle, $outstring);``$outstring`is about 50-55MB at he end, so I would expect it to add 50-60-70MB of RAM usage, but I was completely unprepared to see the script hit the 2GB limit!2nd try (building an array of strings and outputting it) was unsuccessful, too.So I now go with: `.... fputs(implode(',', array($rec->id, $rec->name,...))); fputs(PHP_EOL); ....`Which seems to solve the problem.
DBa
A: 

it has to do with memory fragmentation. Consider two strings, concatenated to one string. Each original must remain util the output is created. The output is longer than either input. Therefore, a new allocation must be made to store the result of such a concatenation. The original strings are freed but they are small blocks of memory. In a case of str . str . str . str you have several temps being created at each . -- and none of them fit in the space thats been freed up. The strings are likely not laid out in contiguous memory (that is, each string is, but the various strings are not laid end to end) due to other uses of the memory. So freeing the string creates a problem because the space can't be reused effectively. So you grow with each tmp you create. And you don't re-use anything, ever. Using the array based implode, you create only 1 output -- exactly the length you require. Performing only 1 additional allocation. So its much more memory efficient and it doesn't suffer from the concatenation fragmentation. Same is true of python. I you need to concatenate strings -- more than 1 concatenation should always be array based. (''.join(['str1','str2','str3']) in python) (implode('', array('str1', 'str2', 'str3')) in PHP) sprintf equivalents are also fine. The memory reported by memory_get_peak_usage is basically always the "last" bit of memory in the virtual map it had to use. So since its always growing, it reports rapid growth. As each allocation falls "at the end" of the currently used memory block.

James Lyons