tags:

views:

29

answers:

2

I am reading some lines from a file in the following format:

Identifier String Number String Number String Number String Number
Identifier String Number String Number String Number
Identifier String Number String Number 
Identifier String Number String Number String Number String Number String Number

In the file that was given to me, I believe that the lines are very very long so the following code:

<?php
        $fp = gzopen($filename, "r");
        while($source = gzgets($fp, 4096)) {
                $trans = array("\x0D" => "");
                $source = strtr($source,$trans);
                $source = trim($source);
                $source = explode(' ', $source);

                foreach($source as $value) {
                        $value = trim($value);

                        //Clean and insert into appropriate column
                }
        }
?>

is producing parsing errors i.e. I am not getting the expected column. When I am expecting a String, it gives me a number and when I want a number, it is returning an identifier. After hours of debugging, now I figured out that the buffer size of 4096 is not able to read really long lines so it is reading only part of the line and then reading the next chunk in the next iteration because of which the inner for loop is being messed up. I tried giving a large buffer value:

while($source = gzgets($fp, 409600)) {

but then my parsing is still messed up for some other weird case. How can I take care of this? Any suggestions?

+1  A: 

You can use gzgetc() to pull each character out of the file one by one, and check for line breaks manually. Once you have a full line, parse it as you normally would. But you dont say what the problem is with using a larger line size with gzgets(), so whether this will help or not, I cant say.

GrandmasterB
@GrandmasterB: +1 for gzgetc. Thank You. I implemented a simple FSM and solved the problem.
Legend
+2  A: 

The tasks of such type is simple to solve with FSM. In the case of FSM you define several states, one of which is "the current char is \r\n" - and now you're free to read in any way you like.

zerkms
@zerkms: +1 and thank you. I guess I must be going crazy for missing that FSM point :)
Legend