views:

88

answers:

3

I have about 20 CSV's that all look like this:

"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"

What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches.

It doesn't matter if any of the other fields don't match, just the email field is important, and the only change to each file would be to add any other start_code values from other files where the email matches.

For example, if the same email appeared in the wicq.csv, oota.csv, and itos.csv it would go from being the following in each file:

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"

to

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"

for all three files (wicq.csv, oota.csv, and itos.csv)

Tools I have available would be OS X command line (awk, sed, etc) as well as perl-though I'm not too familiar with either, and there may be a better way to do this.

A: 

I would approach this by doing something along the lines of:

cut -d ',' -f1,16 *.csv | 
    sort |
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done

This creates a list of all the emails (cut/sort) and start_codes and consolidates (awk) them. Then it replaces (sed) the start_code for each matching email in each file (while).

But I feel like there must be a more efficient way.

Dennis Williamson
I renamed all the files to begin with lower case characters, as anything with an upper case character gave this error:"sed: 1: "R2R.csv": invalid command code R"I now am getting this error:"sed: 1: "bwtl.csv": undefined label 'wtl.csv'"Which I think results from the same initial problem, that sed is taking the filename as a command.
alex
@alex: double check to make sure you're not missing the space before the asterisk or you have any misplaced quotation marks. Are you on a GNU-based (eg. Linux) system? Do your files have slashes in the data? You might try changing the delimiter in the `sed` command to pipes (`'s|old|new|'`) or some other character that's not in your data.
Dennis Williamson
+1  A: 
use strict;
use warnings;
use Text::CSV_XS;

# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser    = Text::CSV_XS->new;

# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i   = 0;
my $MERGE_i   = 1;

# Process all files, creating a set of key-value pairs:
#    $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
    open(my $fh_in, '<', $cf) or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;
        push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
    }
}

# Process the files again, writing new output.
for my $cf (@csv_files){
    open(my $fh_in,  '<', $cf)             or die $!;
    open(my $fh_out, '>', "${cf}_new.csv") or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;

        $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };

        $parser->print($fh_out, \@fields);
        print $fh_out "\n";
    }
}
FM
This worked quite well! I had to throw in "binmode $fh_in, ":utf8";"And manually clean up some blank lines from each file (:g/^$/d) but this worked. Thanks.
alex
A: 

Here's a simple Perl program achieving what you need. It does a single pass on your input by relying on the fact that it is sorted beforehand.

It reads lines and appends the code at long as the email does not change. When the email changes, it prints the record (and fixes extra double quotes in the code field).

#!/usr/bin/perl -l

use strict;
use warnings;

my $last_email = undef;
my @current_record = ();
my @fields = ();

sub print_record {
   # Remove repeated double quotes introduced when we appended the code
  $current_record[15] =~ s/""/, /g;
  print join ",", @current_record;
  @current_record = ();
} 

while (my $input_line = <>) {
  chomp $input_line;
  @fields = split ",", $input_line;

  # Print a record when the email we read changes. Avoid printing on the first
  # loop by checking we have read at least one email ($last_email is defined).
  defined $last_email && ($fields[0] ne $last_email) && print_record;

  if (!@current_record)  {
    # We are starting to process a new email. Grab all fields.
    @current_record = @fields;
  }
  else {
    # We have consecutive records with the same email. Append the code.
    $current_record[15] .= $fields[15];
  }

  # Remember the last processed email. When it changes we will print @current_record.
  $last_email = $fields[0];
}

# Print the last record
print_record

The -l switch has print automatically add a new line char (whatever the os is).

Call it like this:

sort *.csv | ./script.pl
Philippe A.