views:

525

answers:

6

How can I do it?

I have a file that looks like this

foo 1 scaf 3 
bar 2 scaf 3.3

File2 looks like this

foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00

What I want to do is to fine lines that co-occur in file1 and file2 when field 1,2,3 are the same.

Is there a way to do it?

+1  A: 

How about:

cat file1 file2
    | awk '{print $1" "$2" "$3}'
    | sort
    | uniq -c
    | grep -v '^ *1 '
    | awk '{print $2" "$3" "$4}'

This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs). This is usually the case when you're talking about fields within a text file.

What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons). It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).

It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off. That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.

This won't work as expected if two identical keys are only in one file since the files are combined early on. In other words, if you have duplicate keys in file1 but not in file2, that will be a false positive.

Then, the only real solution I can think of is a solution which checks file2 for each line in file1 although I'm sure others may come up with cleverer solutions.


And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:

cat file1
    | sed
        -e 's/ [^ ]*$/ "/'
        -e 's/ /  */g'
        -e 's/^/grep "^/'
        -e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
    >xx99
bash xx99
rm xx99

This one constructs a separate script file to do the work. For each line in file1, it creates a line in the script to look for that in file2. If you want to see how it works, just have a look at xx99 before you delete it.

And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1 and file2 (though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline). It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.

This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)

paxdiablo
+1  A: 

It's probably easiest to combine the first three fields with awk:

awk '{print $1 "_" $2 "_" $3 " " $4}' filename

Then you can use join normally on "field 1"

Michael Mrozek
Useless use of cat.
Dennis Williamson
Oh yeah, I forgot awk takes a filename. Fixed
Michael Mrozek
+1  A: 

Here is a way to do it in Perl:

#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    $file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
        print "Keys occur at file1:$found->[0] and file2:$..\n";
    }
}
close $file2 or die $!;
Kinopiko
+1  A: 

you can try this

awk '{
 o1=$1;o2=$2;o3=$3
 $1=$2=$3="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output

$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3

If you want to omit uncommon lines

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[$1$2$3] = s
 next
}
{
  printf $1 FS $2 FS $3 FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[$1$2$3]"\n"
 } ' file2 file1

output

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00
ghostdog74
+2  A: 
Jonathan Leffler
it doesn't work using GNU join.
ghostdog74
@ghostdog74: yeah - see the rewritten answer. It spent some time deleted while I resolved the issues (and there was a period before you added your comment while it was deleted too; it's been deleted twice).
Jonathan Leffler
A: 

A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files. It's called Fsdb. It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).

Tyler McHenry