ansaurus

Question

perl text::csv - filtering specific columns in a csv document and discarding others

Answer 1

+2 A:

WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?

If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.

If the former, there are many methods, some are more efficient than the others.

Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).

If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:

For the former (roll your own), you can do something like:

my $headers = $csv->getline( $fh ); # First line is headers.
my @headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } @$headers;
while ( my $row = $csv->getline( $fh ) ) {
    my $i = 0;
    my @row_new = grep { $headers_keep[$i++] } $@row;
    push @rows, \@row_new;
}

BUT you can either roll your own OR .

You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \

In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().

You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.

DVK 2010-10-26 18:17:28

`column_names` is actually for setting the column names (for use with `getline_hr`). It does return the column names, but only if you previously called it to set them. Text::CSV has no support for automatically interpreting the first row as column names; you have to do that manually.

cjm 2010-10-26 18:38:26

@cjm - of course. Sorry wasn't clearer.

DVK 2010-10-26 18:48:43

Thanks for the response DVK!

James D 2010-10-26 21:39:15

The reason I need to filter out some columns is because I am running a script that will open up thousands of csv files in different directories. After opening the csv files I will need to run a calculation on the cpu columns of each file. The position of the cpu columns will change depending on the specific configuration of the system in question. For example some systems may have 2 processors and others may have 4 or 8, etc. Thanks Again!

James D 2010-10-26 21:42:15

Answer 2

+1 A:

Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.

use Text::CSV;                                     # load the module
my $csv = Text::CSV->new ();                       # instantiate
open $fh, "<somefile";                             # open the input
while ( my $fields = $csv->getline($fh) ) {        # read a line, and parse it into fields
    print "I got @{$fields}[2..4]\n";              # print the fields of interest
}
close ($fh)                                        # close when done

Len Jaffe 2010-10-26 18:28:09

Answer 3

+1 A:

No, not a specific function in Text::CSV. I would do something like this:

use Text::CSV;

my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";

my $lineCount = 0;
my %desiredColumns;
my %columnContents;

while(<F>) {
  $lineCount++;
  my $csv = Text::CSV->new();
  my $status = $csv->parse($_); # should really check this!
  my @fields = $csv->fields();
  my $colCount = 0;

  if ($lineCount == 1) {
    # Let's look at the column headings.
    foreach my $field (@fields) {
      $colCount++;
      if ($field =~ m/$pattern/) {
        # This heading matches, save the column #.
        $desiredColumns{$colCount} = 1;
      }
    }
  }
  else {
    # Not the header row.  Parse the body of the file.
    foreach my $field (@fields) {
      $colCount++;
      if (exists $desiredColumns{$colCount}) {
        # This is one of the desired columns.
        # Do whatever you want to do with this column!
        push(@{$columnContents{$colCount}}, $field);
      }
    }
  }
}
close(F);

foreach my $key (sort keys %columnContents) {
  print "Column $key: " . join(",", @{$columnContents{$key}}) . "\n\n";
}

Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...

jimtut 2010-10-26 18:40:46

Thanks jimtut! what did you mean by my $status = $csv->parse($_); # should really check this!

James D 2010-10-26 22:00:09

Look at the docs for Text::CSV. The $status is usually used to detect that the Text::CSV operation succeeded (or failed). If you don't check it, you may start trying to operate on the @fields when they haven't been properly populated.

jimtut 2010-10-27 12:00:28

ansaurus

tags:

views:

answers:

perl text::csv - filtering specific columns in a csv document and discarding others

related questions