views:

806

answers:

7

Is there an algorithm or diff-like utilities to find difference between two csv files? Example:

file1
-------
key1,value1
key2,value2
key3,value3
key5,value5
key7,value7

file2
-------
key1,value1
key3,value3
key4,value4
key5,value5
key6,value6

With this diff-like utilities it will output 3 types of records:

  1. Records that only exists in file1 (file1 minus file2 set operation)
  2. Records that only exists in file2 (file2 minus file1 set operation)
  3. Records that exists both in file1 and file2 (intersect set operation)
+1  A: 

there is Algorith::Diff

dsm
A: 

You can use hashes in Perl. Read each file into a separate hash, something like

my %File1 = ();
my %File2 = ();
# Filehandles FP1 and FP2 is opened for read
while (<FP1>) {
    if (/^([^,]+),(.+)$/) {
        my ($key, $value) = ($1, $2);
        $File1{$key} = $value;
    }
}
# Repeat for FP2

To print out the results, you can loop through the hashes and check if the key/value is identical, different or missing in various ways. Example:

for my $key (keys %File1) {
    if (defined($File1{$key}) && defined($File2{$key}) {
        print("$key exists in both files\n");
    } elsif (defined($File1{$key})) {
        print("$key exists only in file1\n");
    }
}
# Repeat for %File2
sunny256
+1  A: 

You can use the unix 'join' command to do this. It is available in Cygwin for Windows as well.

Example:

$ join -t ',' -v 1 file1 file2
key2,value2
key7,value7
$ join -t ',' -v 2 file1 file2
key4,value4
key6,value6
$ join -t ',' file1 file2
key1,value1,value1
key3,value3,value3
key5,value5,value5
JimG
There is also `comm`.
Svante
Both will break when you have a quoted comma in your CSV though.
yangyang
A: 

You could take a look at my FOSS CSV stream editor CSVfix, which does what you want via the join command - no programming needed.

anon
+4  A: 

diff can do what you want..

diff file1.csv file2.csv --old-line-format="< %L" --new-line-format="> %L" --unchanged-line-format="= %L"
Anon
I think this is really something I could use, in one command I got all the differences.
Hendra Saputra
I would add that you should sort both of the CSV files first using the sort command. I would also add that is not 100% reliable if your CSV has any multi-line rows in which case you would need to parse the file properly with Text::xSV, record the keys into hashes and compare them with List::Util and/or List::MoreUtil.
jiggy
A: 

How about an example using SQLite?

DROP TABLE 'file1';
DROP TABLE 'file2';

CREATE TABLE 'file1' (
    key_field VARCHAR primary key,
    value_field VARCHAR
);

CREATE TABLE 'file2' (
    key_field VARCHAR primary key,
    value field VARCHAR
);


.bail off
.separator ,
.import file1.csv file1
.import file2.csv file2

.output stdout
.header on

SELECT col1 AS 'In file1.csv, not in file2.csv' FROM (
    SELECT file1.key_field AS col1,
           file2.key_field AS col2 
    FROM file1 LEFT OUTER JOIN file2
    ON file1.key_field == file2.key_field
) 
WHERE col2 IS NULL
;

SELECT col2 AS 'In file2.csv, not in file1.csv'FROM (
    SELECT file1.key_field AS col1,
           file2.key_field AS col2
    FROM file2 LEFT OUTER JOIN file1
    ON file2.key_field == file1.key_field
) WHERE col1 IS NULL
;

SELECT file1.key_field AS 'In both file1.csv and file2.csv'
    FROM file1 INNER JOIN file2
    WHERE file1.key_field == file2.key_field
;

Here is the output:

C:\Temp> sqlite3 test.db < t.sql
In file1.csv, not in file2.csv
key2
key7
In file2.csv, not in file1.csv
key4
key6
In both file1.csv and file2.csv
key1
key3
key5
Sinan Ünür
A: 

take a look at http://sourceforge.net/projects/csvdiff/

Quote: csvdiff is a Perl script to diff/compare two csv files with the possibility to select the separator. Differences will be shown like: "Column XYZ in record 999" is different. After this, the actual and the expected result for this column will be shown.

max muster