views:

82

answers:

3

I have a script that, when put against a timer, gets progressively slower. It's fairly simple as all it does is reads a line, checks it then adds it to the database, then proceeds to the next line.

Here's the output of it gradually getting worse:

Record: #1,001 Memory: 1,355,360kb taking 1.84s
Record: #1,001 Memory: 1,355,360kb taking 1.84s
Record: #2,002 Memory: 1,355,192kb taking 2.12s
Record: #3,003 Memory: 1,355,192kb taking 2.39s
Record: #4,004 Memory: 1,355,192kb taking 2.65s
Record: #5,005 Memory: 1,355,200kb taking 2.94s
Record: #6,006 Memory: 1,355,376kb taking 3.28s
Record: #7,007 Memory: 1,355,176kb taking 3.56s
Record: #8,008 Memory: 1,355,408kb taking 3.81s
Record: #9,009 Memory: 1,355,464kb taking 4.07s
Record: #10,010 Memory: 1,355,392kb taking 4.32s
Record: #11,011 Memory: 1,355,352kb taking 4.63s
Record: #12,012 Memory: 1,355,376kb taking 4.90s
Record: #13,013 Memory: 1,355,200kb taking 5.14s
Record: #14,014 Memory: 1,355,184kb taking 5.43s
Record: #15,015 Memory: 1,355,344kb taking 5.72s

The file, unfortunately, is around ~20gb so I'll probably be dead by the time the whole thing is read at the rate of increase. The code is (mainly) below but I suspect it's something to do with fgets() , but I am not sure what.

    $handle = fopen ($import_file, 'r');

    while ($line = fgets ($handle))
    {
        $data = json_decode ($line);

        save_record ($data, $line);
    }

Thanks in advance!

EDIT:

Commenting out 'save_record ($data, $line);' appears to do nothing.

A: 

http://php.net/manual/en/function.fgets.php

According to Leigh Purdie comment, there are some performance issue on big files with fgets. If your JSON objects are bigger than his test lines, you might it the limits much faster

use http://php.net/manual/en/function.stream-get-line.php and specify a length limit

Johan Buret
A: 

Alright, a performance problem. Obviously something is going quadratic when it shouldn't, or more to the point, something that should be constant-time seems to be linear in the number of records dealt with so far. The first question is what's the minimal scrap of code that exhibits the problem. I would want to know if you get the same problematic behavior when you comment out all but reading the file line by line. If so, then you'll need a language without that problem. (There are plenty.) Anyway, once you see the expected time characteristic, add statements back in one-by-one until your timing goes haywire, and you'll have identified the problem.

You instrumented something or other to get the timings. Make sure those can't cause a problem by executing them alone 15000 times or so.

Ian
+1  A: 

Sometimes it is better to use system commands for reading these large files. I ran into something similar and here is a little trick I used:

$lines = exec("wc -l $filename");
for($i=1; $i <= $lines; $i++) {
   $line = exec('sed \''.$i.'!d\' '.$filename);

   // do what you want with the record here
}

I would not recommend this with files that cannot be trusted, but it runs fast since it pulls one record at a time using the system. Hope this helps.

cdburgess
+1 good idea, I'll consider this in the future.
alex