ansaurus

Question

In Windows, what's the most efficient way to compare two files and return just the records missing in the second file that were originally present in the first file?

Answer 1

A:

It would not be hard to write a small console application that went through the first file, parsed out the keys and checked the second file for the rows matching the keys and then creating the third file. I guess I'm saying this seems like a good case to roll your own. :) By the way, this is an O(mn) operation where m, n are size of file 1 and file 2, so it's probably not going to be uber speedy.

JP Alioto 2009-05-05 23:34:30

Answer 2

+2 A:

if i were to do this in perl I'd just use a couple of hashes,


#!/usr/bin/perl -w                                                                                                           
use strict;
use warnings;

my %orig; my %new; my %changed;

open(F1,"<$ARGV[0]")||die"Couldn't open $ARGV[0]: $!\n";
while(<F1>){
    chomp;
    @_ = split(/,/);
    $orig{$_[0]} = $_;
}
close(F1);

open(F2,"<$ARGV[1]")||die"Couldn't open $ARGV[1]: $!\n";
while(<F2>){
    chomp;
    @_ = split(/,/);
    if($orig{$_[0]}){
        if($orig{$_[0]} ne $_){
            $changed{$_[0]} = $orig{$_[0]}."||".$_;
        }
        delete $orig{$_[0]};
    }else{
        $new{$_[0]} = $_;
    }
}
close(F2);

print "Deleted:\n";
print map{$orig{$_}."\n"} sort {$b<=>$a} keys %orig;
print "Added:\n";
print map{$new{$_}."\n"} sort {$b<=>$a} keys %new;
print "Changed:\n";
print map{$changed{$_}."\n"} sort {$b<=>$a} keys %changed;

assuming your text examples are in files f1.txt and f2.txt,


kettle$ ./compare.pl f1.txt f2.txt

Deleted:
111,Name 111,ATT X
Added:
Changed:
123,Name 123,ATT X||123,Name 123,ATT CHANGED

blackkettle 2009-05-05 23:40:22

Answer 3

+1 A:

#!/usr/bin/perl

use strict;
use warnings;

@ARGV == 2 or die "mycompare oldfile newfile\n";

my ($oldfile, $newfile) = @ARGV;

my %newrecords;

open my $new, '<', $newfile
    or die "Cannot open '$newfile':$!";

scalar <$new>; # skip header

while ( my $line = <$new> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    $newrecords{ $record } = 1;
}

close $new;

open my $old, '<', $oldfile
    or die "Cannot open '$oldfile': $!";

scalar <$old>; # skip header

while ( my $line = <$old> ) {
    next unless $line =~ /\S/;
    my ($record) = split /,/, $line;
    print $line unless exists $newrecords{ $record };
}

close $new;

__END__

C:Temp> mycompare old.csv new.csv
111,Name 111,ATT X

Sinan Ünür 2009-05-05 23:41:27

Answer 4

A:

I would use diff, and if it's not available directly, use it through cygwin.

Svante 2009-05-05 23:51:36

Answer 5

+4 A:

diff a.input b.input | grep ^[\>]

http://gnuwin32.sourceforge.net/packages/diffutils.htm

http://gnuwin32.sourceforge.net/packages/grep.htm

nullptr 2009-05-06 00:00:38

The Perl scripts left by blackkettle and Sinan both do the trick for me (many thanks!), but this is more along the lines of what I was originally looking for. However, with gnuwin32 grep, I get an error attempting ^[\>]. I also don't see how you are going to weed out the changed records from the deleted ones (since both will show up with <).

Stephen Pace 2009-05-06 04:11:57

You should put the argument to grep into quotes, as both ^ and > have a special meaning for cmd.

Joey 2009-05-06 05:52:54

Johannes, thanks, that eliminates the error for me; however, I don't think it addresses the second point. That particular grep is returning both the deleted records and the changed records. I would like to see only the deleted records.

Stephen Pace 2009-05-06 06:32:38

Answer 6

A:

I worked on a faintly similar system once upon a long time ago (distressingly close to 20 years ago, in fact). The data was stored in a database, and the incoming data had to be compared with the previous week's data. There were 74 columns of data in the data, and some of them took the form of 'prev code, prev date' pairs (for several cycles). So, one legitimate set of changes was for:

Old:
name1:  value1        date1:  2000-01-01
name2:  value2        date2:  1995-05-31
name3:  value3        date3:  1990-10-13

New:
name1:  New Name      date1:  2009-03-31
name2:  Other Name    date2:  2005-04-12
name3:  value1        date3:  2000-01-01

In this scenario, what was 'name1, date1' has become 'name3, date3'; there appears to have been an intermediate name change at a date some time in the past, and a recent name change. The intelligence about name changes could be delayed. Working out what was common with what went before and what was brand new was tough. Indeed, the whole process was tricky.

I actually developed a bilingual self-regenerating program to manage this stuff. The top part of the source code took advantage of the comment notations in the primary language (Informix 4GL) which used both shell-style #...eol comments and {...} comments (the latter able to spread over multiple lines. Of course, this is a shell construct for I/O redirection, too. So, I had the shell script that would generate the I4GL code embedded in a { ... } comment, and the code was generated from a table defining the 70+ columns and how each one needed to be handled. This saved a lot of typing over time. h

Jonathan Leffler 2009-05-06 00:56:51

Answer 7

A:

There's also Algorithm::Diff available from CPAN.

singingfish 2009-05-06 14:37:36

ansaurus

tags:

views:

answers:

In Windows, what's the most efficient way to compare two files and return just the records missing in the second file that were originally present in the first file?

related questions