ansaurus

Question

How do I efficiently parse a CSV file in Perl?

Answer 1

A:

You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.

#(no error handling here!)    
open FILE, $filename
while (<FILE>) {
     @csv = split /,/ 

     # now parse the csv however you want.

}

Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.

YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.

Edit

Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.

Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.

When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.

That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.

Byron Whitlock 2010-06-17 19:56:26

thanks for the response. please see edits

Mike 2010-06-17 20:04:05

Since perl 5.8, when the file is in memory (say, in a variable called `$scalar`), you can still use the filehandle iterator on it with `open(FILE,"<",\$scalar)`

mobrule 2010-06-17 21:43:00

Answer 2

+9 A:

The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.

About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:

my $file = 'somefile.csv';
my @data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
    chomp $line;
    my @fields = split(/,/, $line);
    push @data, \@fields;
}

This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:

    my @fields = Text::ParseWords::parse_line(',', 0, $line);

Michael Carman 2010-06-17 20:07:51

Answer 3

+4 A:

As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).

If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:

use lib '/path/to/my/perllib';
use Text::CSV_PP;

Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:

my $data = stupid_required_interface_that_reads_the_entire_giant_file();

open my $text_handle, '<', \$data
   or die "Failed to open the handle: $!";

And then read via the Text::CSV interface:

my $csv = Text::CSV->new ( { binary => 1 } )
             or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
    ...
}

or the sub-optimal split on commas:

while (my $line = <$text_handle>) {
    my @csv = split /,/, $line;
    ... # regular work as before.
}

With this method, the data is only copied a bit at a time out of the scalar.

Robert P 2010-06-17 21:00:31

And the second most correct way to do this is to create the `Mike::Text::CSV` module, copy the source code from `Text::CSV` into it, and add a disclaimer about how it was "inspired" by the open source Text::CSV module.

mobrule 2010-06-17 21:40:37

I like it! I like it very much.

Robert P 2010-06-17 21:58:27

Answer 4

+1 A:

Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:

my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
    push @{$result->[-1]}, $1;
    push @$result, [] if $2 eq "\n";
    last unless $2;
}

If you need to have $csv untouched:

local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
    next unless defined $_;
    if($_ eq "\n") {
        push @$result, []; }
    else {
        push @{$result->[-1]}, $_; }
}

ZyX 2010-06-17 21:12:07

Other than padding your lines of code count, in what way is this better than `split`?

mobrule 2010-06-17 21:37:34

@modrule If you use `split`, you need to use it twice, so the data will be read twice, my solution reads data only once. // But this is true only if the data is already loaded.

ZyX 2010-06-18 11:03:49

Answer 5

A:

Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:

open(my $fh, '<', $input_file_path) or die;
my @all_lines = <$fh>;
for my $line (@all_lines) {
  chomp $line;
  my @fields = split ',', $line;
  process_fields(@fields);
}

And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...

Dave Sherohman 2010-06-18 09:56:38

Answer 6

+1 A:

Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").

sub csvsplit {
        my $line = shift;
        my $sep = (shift or ',');

        return () unless $line;

        my @cells;
        $line =~ s/\r?\n$//;

        my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;

        while($line =~ /$re/g) {
                my $value = defined $1 ? $1 : $2;
                push @cells, (defined $value ? $value : '');
        }

        return @cells;
}

Use it like this:

while(my $line = <FILE>) {
    my @cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}

jkramer 2010-06-18 10:22:25

ansaurus

tags:

views:

answers:

How do I efficiently parse a CSV file in Perl?

related questions