ansaurus

Question

PHP: What is an efficient way to parse a text file containing very long lines?

Answer 1

+2 A:

You could write a character-by-character accumulation loop that (a) pushes field strings onto an array when it encounters commas and (b) calls a function to save accumulated field strings to a mysql database when it finds the record signifier:

while($c = fgetc($fp)) {
  if($c == ',') {
    $fields[] = implode(null,$accumulator);
    $accumulator = array();
  } else if($c == '\\') {
    save_fields_to_mysql($fields);
    $fields = array();
    $accumulator = array();
  } else
    $accumulator[] = $c;
}

This will probably work for you if you're certain that your fields never contain your field or record separators as data.

If that's a possibility, you'll need to come up with an escape sequence to represent literal values of your field and record separator (and probably your escape sequence as well). Let's say that this is the case, and assume the % sign as an escape character:

define('ESCAPED',1);
define('NORMAL',0);

$readState = NORMAL;
while($c = fgetc($fp)) {
  if($readState == ESCAPED) {
    $accumulator[] = $c;
    $readState = NORMAL;
  } else if($c == '%') {
    $readState = ESCAPED;
  } else if($c == ',') {
    $fields[] = implode(null,$accumulator);
    $accumulator = array();
  } else if($c == '\\') {
    save_fields_to_mysql($fields);
    $fields = array();
    $accumulator = array();
  } else
    $accumulator[] = $c;
}

ie, any occurance of a % sets a state variable which indicates on the next pass through the loop, whatever character we read will be taken as literal data which is part of a field rather than a signifier.

This should keep your memory usage at a minimum.

[Update] What about I/O efficiency?

One commenter correctly pointed out that this illustration is pretty I/O intensive, and since I/O tends to be the most costly operation in terms of time, it's entirely possible it wouldn't be an acceptable solution.

At one other end of the spectrum we have the option of buffering the entire file into memory, which includes the original memory-intensive solutions the Asker mentioned but wanted to avoid. The happy medium probably lies somewhere in the middle: we can use the read-limit you can pass as the second argument to fgets() to pull in a somewhat large (but not ridiculously large) number of characters in a single I/O gulp, and then process that buffer character-by-character instead of the I/O stream, refilling it when we burn through the buffer.

This does make the read process a little more code intensive than $c = fgetc($fp), though, because you have to monitor where you are in the buffer and how full the buffer is as well as where you are in the file. You can do this with a series of flags and index variables inside the read loop if you want, but it might be more convenient to have an abstraction something like this:

class StrBufferedChrReader {

    private $_filename;
    private $_fp; 

    private $_bufferIdx;
    private $_bufferMax = 2048;
    private $_buffer;

    function __construct($filename=null,$bufferMax=null) {
        if($bufferMax) $this->_bufferMax = $bufferMax;
        if($filename) $this->open($filename);
    }

    function _refillBuffer() {
        if($this->_fp) {
            $this->_buffer = fgets($this->_fp,$this->_bufferMax + 1);
            $this->_bufferIdx = 0;
            return $this->_buffer;
        }
        return false;
    }

    function open($filename=null) {
        if($filename) $this->_filename = $filename;
        if($this->_fp = fopen($this->_filename)) 
            $this->_refillBuffer();
        return $this->_fp;
    }

    function getc() {
        if($this->_bufferIdx == $this->_bufferMax) 
            if(!$this->_refillBuffer())
                return false;
        return $this->_buffer[$this->_bufferIdx++];
    }

    function close() {
        $this->_buffer = null;
        $this->_bufferIdx = null;
        return fclose($this->_fp);
    }
}

Which you could use in either loop above like so:

$r = new StrBufferedChrReader($filename,$bufferSize);
while($c = $r->getc()) {
    ...

Something like this allows you to stake out a lot of different spots along the continuum between a memory-intensive solution and an I/O intensive solution by changing $bufferSize. Bigger $bufferSize, more memory usage, fewer I/O ops. Smaller $bufferSize, less memory usage, more I/O ops.

(Note: don't assume that class is production-ready. It's meant as an illustration of a possible abstraction, may contain off-by-one or other errors. May cause blurred vision, lack of sleep, heart palpitations, or other side effects. Check with a doctor and unit testing before using.)

Weston C 2010-04-01 06:15:36

While this certainly keeps the memory usage at the bare minimum, going character-by-character through a large file is an incredible amount of I/Os. Is there any easy way to buffer this to read bigger blocks?

Tony Trozzo 2010-04-01 16:10:19

Sure. Read the whole file into a $str and then use $str[$i] to look at it character-by-character. ;)Okay, obviously there has to be a happy medium between that memory-bound approach (and which the Asker wanted to avoid) and the I/O-bound approach I illustrated. You can use fgets with a length-limit parameter (e.g., fgets($fp,2048)), which means you can pull a limited number of characters from the file even if you know it has very long lines. If it were me, I'd probably abstract this behind an object with a "next()" method... maybe I'll update my answer to show this.

Weston C 2010-04-01 19:24:53

Great update on your answer. Pretty sure you've covered all aspects of this question, and even threw in some humor for good measure!

Tony Trozzo 2010-04-02 05:35:40

Hmm, reading over this function gave me a headache and slight nausea. In all seriousness, I figured the solution was probably to write a simple Buffered File Reader myself, but I wanted to make sure php didn't already have something similar. I appreciate your response and the time you spent typing up that demo class.

Shaun 2010-04-02 05:40:56

Answer 2

A:

maybe use strtok() function?

$string = "Hello world. Beautiful day today."; $token = strtok($string, " ");

while ($token != false) { echo "$token
"; $token = strtok(" "); }

SethCoder 2010-04-01 13:00:58

ansaurus

tags:

views:

answers:

PHP: What is an efficient way to parse a text file containing very long lines?

related questions