views:

451

answers:

7

At regular intervals we are receiving CSV files from an external source that we have little control over. These files are complete sets of current records; however, any records that have been deleted since the previous are not present. We would like to compare the two files and create a separate file of deleted records so we can do some additional processing on them. In an application in another area we have a commercial sort package (CoSort) that does this out of the box; however, we don't have access to that here. The volumes aren't that large, though, and it seems like this is something that standard or free tools might be able to handle quite easily. Ideally this would take the form of a Windows batch file, but Perl or awk solutions would be okay too. Example input files:

Previous File:

X_KEY,X_NAME,X_ATTRIBUTE
123,Name 123,ATT X
111,Name 111,ATT X
777,Name 777,ATT Y

Incoming File:

X_KEY,X_NAME,X_ATTRIBUTE
777,Name 777,ATT Y
123,Name 123,ATT CHANGED

Resulting File should be at a minimum:

111,Name 111

But if the attributes from the deleted records come through too, that is fine.

So far I have a batch file that uses freeware CMSort to sort the two files minus the header record to make it easier for some type of diff process:

REM Sort Previous File, Skip Header

C:\Software\CMSort\cmsort.exe /H=1 x_previous.txt x_previous_sorted.txt

REM Sort Incoming File, Skip Header

G:\Software\CMSort\cmsort.exe /H=1 x_incoming.txt x_incoming_sorted.txt

But the 'compare and show only the missing records from the first file' bit is eluding me. Part of the complexity is numerous attributes can change among the records that are left, so it isn't a pure diff. However, it feels like a specialized diff command--one that is limited to checking just the key field, not the entire record. I can't seem to get the syntax correct, though. Ideas? Record counts shouldn't exceed 50k records.

Note: If this were SQL and the data were sitting in tables, we could use the EXCEPT operator but moving the data to the database in this case is not an option.

A: 

It would not be hard to write a small console application that went through the first file, parsed out the keys and checked the second file for the rows matching the keys and then creating the third file. I guess I'm saying this seems like a good case to roll your own. :) By the way, this is an O(mn) operation where m, n are size of file 1 and file 2, so it's probably not going to be uber speedy.

JP Alioto
+2  A: 

if i were to do this in perl I'd just use a couple of hashes,


#!/usr/bin/perl -w                                                                                                           
use strict;
use warnings;

my %orig; my %new; my %changed;

open(F1,"<$ARGV[0]")||die"Couldn't open $ARGV[0]: $!\n";
while(<F1>){
    chomp;
    @_ = split(/,/);
    $orig{$_[0]} = $_;
}
close(F1);

open(F2,"<$ARGV[1]")||die"Couldn't open $ARGV[1]: $!\n";
while(<F2>){
    chomp;
    @_ = split(/,/);
    if($orig{$_[0]}){
        if($orig{$_[0]} ne $_){
            $changed{$_[0]} = $orig{$_[0]}."||".$_;
        }
        delete $orig{$_[0]};
    }else{
        $new{$_[0]} = $_;
    }
}
close(F2);

print "Deleted:\n";
print map{$orig{$_}."\n"} sort {$b<=>$a} keys %orig;
print "Added:\n";
print map{$new{$_}."\n"} sort {$b<=>$a} keys %new;
print "Changed:\n";
print map{$changed{$_}."\n"} sort {$b<=>$a} keys %changed;

assuming your text examples are in files f1.txt and f2.txt,


kettle$ ./compare.pl f1.txt f2.txt

Deleted:
111,Name 111,ATT X
Added:
Changed:
123,Name 123,ATT X||123,Name 123,ATT CHANGED

blackkettle
+1  A: 
#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 2 or die "mycompare oldfile newfile\n";

my ($oldfile, $newfile) = @ARGV;

my %newrecords;

open my $new, '<', $newfile
    or die "Cannot open '$newfile':$!";

scalar <$new>; # skip header

while ( my $line = <$new> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    $newrecords{ $record } = 1;
}

close $new;

open my $old, '<', $oldfile
    or die "Cannot open '$oldfile': $!";

scalar <$old>; # skip header

while ( my $line = <$old> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    print $line unless exists $newrecords{ $record };
}

close $new;

__END__

C:Temp> mycompare old.csv new.csv
111,Name 111,ATT X
Sinan Ünür
A: 

I would use diff, and if it's not available directly, use it through cygwin.

Svante
+4  A: 
diff a.input b.input | grep ^[\>]

http://gnuwin32.sourceforge.net/packages/diffutils.htm

http://gnuwin32.sourceforge.net/packages/grep.htm

nullptr
The Perl scripts left by blackkettle and Sinan both do the trick for me (many thanks!), but this is more along the lines of what I was originally looking for. However, with gnuwin32 grep, I get an error attempting ^[\>]. I also don't see how you are going to weed out the changed records from the deleted ones (since both will show up with <).
Stephen Pace
You should put the argument to grep into quotes, as both ^ and > have a special meaning for cmd.
Joey
Johannes, thanks, that eliminates the error for me; however, I don't think it addresses the second point. That particular grep is returning both the deleted records and the changed records. I would like to see only the deleted records.
Stephen Pace
A: 

I worked on a faintly similar system once upon a long time ago (distressingly close to 20 years ago, in fact). The data was stored in a database, and the incoming data had to be compared with the previous week's data. There were 74 columns of data in the data, and some of them took the form of 'prev code, prev date' pairs (for several cycles). So, one legitimate set of changes was for:

Old:
name1:  value1        date1:  2000-01-01
name2:  value2        date2:  1995-05-31
name3:  value3        date3:  1990-10-13

New:
name1:  New Name      date1:  2009-03-31
name2:  Other Name    date2:  2005-04-12
name3:  value1        date3:  2000-01-01

In this scenario, what was 'name1, date1' has become 'name3, date3'; there appears to have been an intermediate name change at a date some time in the past, and a recent name change. The intelligence about name changes could be delayed. Working out what was common with what went before and what was brand new was tough. Indeed, the whole process was tricky.

I actually developed a bilingual self-regenerating program to manage this stuff. The top part of the source code took advantage of the comment notations in the primary language (Informix 4GL) which used both shell-style #...eol comments and {...} comments (the latter able to spread over multiple lines. Of course, this is a shell construct for I/O redirection, too. So, I had the shell script that would generate the I4GL code embedded in a { ... } comment, and the code was generated from a table defining the 70+ columns and how each one needed to be handled. This saved a lot of typing over time. h

Jonathan Leffler
A: 

There's also Algorithm::Diff available from CPAN.

singingfish