ansaurus

Question

Matching lines across multiple csv files and merging a particular field

Answer 1

A:

I would approach this by doing something along the lines of:

cut -d ',' -f1,16 *.csv | 
    sort |
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done

This creates a list of all the emails (cut/sort) and start_codes and consolidates (awk) them. Then it replaces (sed) the start_code for each matching email in each file (while).

But I feel like there must be a more efficient way.

Dennis Williamson 2010-07-29 18:08:59

I renamed all the files to begin with lower case characters, as anything with an upper case character gave this error:"sed: 1: "R2R.csv": invalid command code R"I now am getting this error:"sed: 1: "bwtl.csv": undefined label 'wtl.csv'"Which I think results from the same initial problem, that sed is taking the filename as a command.

alex 2010-07-29 19:22:51

@alex: double check to make sure you're not missing the space before the asterisk or you have any misplaced quotation marks. Are you on a GNU-based (eg. Linux) system? Do your files have slashes in the data? You might try changing the delimiter in the `sed` command to pipes (`'s|old|new|'`) or some other character that's not in your data.

Dennis Williamson 2010-07-29 19:41:55

Answer 2

+1 A:

use strict;
use warnings;
use Text::CSV_XS;

# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser    = Text::CSV_XS->new;

# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i   = 0;
my $MERGE_i   = 1;

# Process all files, creating a set of key-value pairs:
#    $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
    open(my $fh_in, '<', $cf) or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;
        push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
    }
}

# Process the files again, writing new output.
for my $cf (@csv_files){
    open(my $fh_in,  '<', $cf)             or die $!;
    open(my $fh_out, '>', "${cf}_new.csv") or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;

        $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };

        $parser->print($fh_out, \@fields);
        print $fh_out "\n";
    }
}

FM 2010-07-29 22:56:04

This worked quite well! I had to throw in "binmode $fh_in, ":utf8";"And manually clean up some blank lines from each file (:g/^$/d) but this worked. Thanks.

alex 2010-08-03 16:38:35

Answer 3

A:

Here's a simple Perl program achieving what you need. It does a single pass on your input by relying on the fact that it is sorted beforehand.

It reads lines and appends the code at long as the email does not change. When the email changes, it prints the record (and fixes extra double quotes in the code field).

#!/usr/bin/perl -l

use strict;
use warnings;

my $last_email = undef;
my @current_record = ();
my @fields = ();

sub print_record {
   # Remove repeated double quotes introduced when we appended the code
  $current_record[15] =~ s/""/, /g;
  print join ",", @current_record;
  @current_record = ();
} 

while (my $input_line = <>) {
  chomp $input_line;
  @fields = split ",", $input_line;

  # Print a record when the email we read changes. Avoid printing on the first
  # loop by checking we have read at least one email ($last_email is defined).
  defined $last_email && ($fields[0] ne $last_email) && print_record;

  if (!@current_record)  {
    # We are starting to process a new email. Grab all fields.
    @current_record = @fields;
  }
  else {
    # We have consecutive records with the same email. Append the code.
    $current_record[15] .= $fields[15];
  }

  # Remember the last processed email. When it changes we will print @current_record.
  $last_email = $fields[0];
}

# Print the last record
print_record

The -l switch has print automatically add a new line char (whatever the os is).

Call it like this:

sort *.csv | ./script.pl

Philippe A. 2010-07-30 02:01:50

ansaurus

tags:

views:

answers:

Matching lines across multiple csv files and merging a particular field

related questions