tags:

views:

155

answers:

6

i have two text files

  • file 1

    number,name,account id,vv,sfee,dac acc,TDID
    7000,john,2,0,0,1,6
    7001,elen,2,0,0,1,7
    7002,sami,2,0,0,1,6
    7003,mike,1,0,0,2,1
    8001,nike,1,2,4,1,8
    8002,paul,2,0,0,2,7 
    
  • file 2

    number,account id,dac acc,TDID
    7000,2,1,6
    7001,2,1,7
    7002,2,1,6
    7003,1,2,1
    

i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion?


EDIT: From the OP's comments:

nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt

.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?

A: 

Quick answer: Use cut to split out the fields you need and diff to compare the results.

Carl Smotricz
thanks for your comment. but this is not what i am looking for... i know how to compare two text file between two single column. but i want to compare multiple column..
gyrous
Nothing stops you from extracting multiple columns with `cut` for your comparison. Or am I missing something?
Carl Smotricz
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?
gyrous
Having re-read your question, my approach won't do exactly what you want, sorry. The awk statement you give could be fiddled to work with multiple fields, but I'm not knowledgeable enough with awk to write it up for you. I suggest you include that line of code in an edit of your question, it will give people a lot more to work with for a good answer.
Carl Smotricz
+1  A: 

TxtSushi looks like what you want. It allows to work with CSV files using SQL.

Roman Cheplyaka
can you help me to do this in unix,because my file size is huge more than 2GB.
gyrous
Indeed I'm not sure that TxtSushi was created with such scale in mind.Then maybe importing to a real DBMS would make sense?
Roman Cheplyaka
+1  A: 

It's not an elegant one-liner, but you could do it with perl.

#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
    $k{$_[0]} = [@_];
}
close A;

open B, $ARGV[1];
while(split/,/,<B>) {
    print join(',',@{$k{$_[0]}}) if
        defined($k{$_[0]}) &&
        $k{$_[0]}->[2] == $_[1] &&
        $k{$_[0]}->[5] == $_[2] &&
        $k{$_[0]}->[6] == $_[3];
}
close B;
pdehaan
sorry! i don't understand can you explain where i give my two input files. kindly mention as file1 and file2 then is easy for me to understand
gyrous
Save the above script something like "cmp_csv.pl". Make the script executable `chmod +x cmp_csv.pl`. And finally run the script: `./cmp_csv.pl file1 file2 > outfile`.
B Johnson
./cmp_csv.pl: line 4: open: command not found./cmp_csv.pl: line 5: syntax error near unexpected token `)'./cmp_csv.pl: line 5: `while(split/,/,<A>) { 'while i am running the script it shows error above i mentioned
gyrous
What version of perl are you using? `perl -v`
B Johnson
sorry this one you want v5.8.4 built for sun4-solaris-64int
gyrous
@gyrous Sorry, but I'm not an expert at perl which is why I'm not sure what your problem is (the above stuff works fine on my linux box). Maybe if you tag your question with 'perl' or 'awk' someone else with more experience will help you with your question.
B Johnson
The clue to what's wrong is the "line 4" of the error above.gyrous, copy the perl script **exactly** as written. The first character of the file needs to be the "#" character.
Borealid
@pdehaan and @ Borealid, it works great. thanks alot,But i was little confused,the script that have given is working for eg files i given. but in real i have many columns. I don't understand how you mentioned fileds between two files in your script.kinldy clarify whether this script works only for the eg files i given or how can i edit field i wanted to compare.
gyrous
The lines like $k{$_[0]}->[2] == $_[1] are doing the column comparisons. The left side ($k{$_[0]}->[x]) corresponds to a column in the first file, the right side ($_[x]) corresponds to a column in the second file. The first column of both files is being used as an index (value assumed to be unique). This is represented by $_[0].
pdehaan
A: 

This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches.

sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'

This works as follows

  1. sed -n '2,$p' file1 sends file1 to STDOUT without the header line
  2. The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1
  3. The sort command ensures that file1 is in the same order as file2
  4. The join command joins file2 and STDOUT only writing records that have a matching record in file2
  5. The final awk command prints just the original part of file1

In order for this to work you must ensure that file2 is sorted before running the command.

Running this against your example data gave the following result

7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

EDIT

I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again.

Something like this would do that for you

mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
  grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig

You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.

Steve Weet
thanks steve weet, its working fine, but my problem is i have file size of more than 2GB. it is not possible to sort that huge file. you have any suggestion for this?
gyrous
The obvious answer is to ask whoever is providing the files to you if they can sort them when they generate them. Other than that then no real ideas I'm afraid.
Steve Weet
Are you running out of RAM or disk space. When are you getting an error and what is the error. See EDIT for suggestions on sorting file2
Steve Weet
A: 

Not really well tested, but this might work:

join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'

(Of course, this assumes the input files are sorted).

ninjalj
thanks ninjalj, its working fine, but my problem is i have file size of more than 2GB. it is not possible to sort that huge file. you have any suggestion for this?
gyrous
+1  A: 

This awk one-liner works for multi-column on unsorted files:

awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt

In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so:

file1.txt

7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1

file2.txt

7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7

Output

$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1

Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO

awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt
SiegeX