tags:

views:

511

answers:

5

Requirements:

a) I have a very large CSV file to read (about 3Gb).

b) I won't need all records, I mean, there are some conditionals that we can use, for example, if the 3rd CSV column content has 'XXXX' and 4th column has '999'. Can I use these conditionals to improve the read process? If so, how can I do that using Perl?

Please you should show an example (Perl Script) in your answer.

Thanks in advance.

+4  A: 

Use Text::CSV

Maxwell Troy Milton King
for a really big file like this you should be using Text::CSV_XS
singingfish
it will use ::_XS if it is present on your system.
Evan Carroll
in other words: XS modules typically provide better memory and/or CPU performance than pure Perl modules, which would be helpful with large files such as the one you described. See http://en.wikipedia.org/wiki/XS_%28Perl%29
molecules
+2  A: 

use a module like Text::CSV, however, if you know that your data will not have embedded commas and its simple CSV format, then a simple while loop to iterate the file will suffice

while (<>){
  chomp;
  @s = split /,/;
  if ( $s[2] eq "XXXX" && $s[3] eq "999" ){
    # do something;
  } 
}
ghostdog74
+12  A: 

Here's a solution:

#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
use autodie;
my $csv = Text::CSV_XS->new();
open my $FH, "<", "file.txt";
while (<$FH>) {
    $csv->parse($_);
    my @fields = $csv->fields;
    next unless $fields[1] =~ /something I want/;
    # do the stuff to the fields you want here
}
singingfish
Missing part of your dereference operator in the call to parse. And, the regex is malformed, but other than that great example.
Evan Carroll
+2  A: 

The Text::CSV module is a great solution for this. Another option is the DBD::CSV module, which provides a slightly different interface. The DBI interface is really useful if you're developing applications that have to access data from different forms of databases, including relational databases and comma-separated text files.

Here's some example code:

#!/usr/bin/perl

use strict;
use warnings;
use DBI;

$dbh = DBI->connect ("DBI:CSV:f_dir=/home/joe/csvdb") 
    or die "Cannot connect: $DBI::errstr";

$sth = $dbh->prepare ("SELECT id, name FROM info.txt WHERE id > 1 ORDER by id");
$std->execute;

my ($id,$name);
$sth->bind_columns (\$id, \$name);
while ($sth->fetch) {
    print "Found result row: id = $id, name = $name\n";
}
$sth->finish;

I'd use Text::CSV for this task unless you're planning on talking to other types of databases, but in Perl TIMTOWDI and it helps to know your options.

James Thompson
+5  A: 

Your a) question has been answered a few times over already, but b) has not yet been addressed:

I won't need all records, I mean, there are some conditionals that we can use, for example, if the 3rd CSV column content has 'XXXX' and 4th column has '999'. Can I use these conditionals to improve the read process?

No. How would you know whether the 3rd CSV column contains 'XXXX' or the 4th is '999' without reading the line first? (DBD::CSV lets you hide this behind an SQL WHILE clause, but, because CSV is unindexed data, it still needs to read in every line to determine which match the condition(s) and which don't.)

Pretty much the only way the content of a line could be used to let you skip reading parts of the file is if it contained information telling you 1) "skip the section following this line" and 2) "continue reading at byte offset nnn".

Dave Sherohman
Yeah, that's true. Thanks.
André Diniz