ansaurus

Question

Answer 1

+1 A:

How about:

cat file1 file2
    | awk '{print $1" "$2" "$3}'
    | sort
    | uniq -c
    | grep -v '^ *1 '
    | awk '{print $2" "$3" "$4}'

This is assuming you're not too worried about the white space between fields (in other words, three tabs and a space is no different to a space and 7 tabs). This is usually the case when you're talking about fields within a text file.

What it does is output both files, stripping off the last field (since you don't care about that one in terms of comparisons). It the sorts that so that similar lines are adjacent then uniquifies them (replaces each group of adjacent identical lines with one copy and a count).

It then gets rid of all those that had a one-count (no duplicates) and prints out each with the count stripped off. That gives you your "keys" to the duplicate lines and you can then use another awk iteration to locate those keys in the files if you wish.

This won't work as expected if two identical keys are only in one file since the files are combined early on. In other words, if you have duplicate keys in file1 but not in file2, that will be a false positive.

Then, the only real solution I can think of is a solution which checks file2 for each line in file1 although I'm sure others may come up with cleverer solutions.

And, for those who enjoy a little bit of sado-masochism, here's the afore-mentioned not-overly-efficient solution:

cat file1
    | sed
        -e 's/ [^ ]*$/ "/'
        -e 's/ /  */g'
        -e 's/^/grep "^/'
        -e 's/$/ file2 | awk "{print \\$1\\" \\"\\$2\\" \\"\\$3}"/'
    >xx99
bash xx99
rm xx99

This one constructs a separate script file to do the work. For each line in file1, it creates a line in the script to look for that in file2. If you want to see how it works, just have a look at xx99 before you delete it.

And, in this one, the spaces do matter so don't be surprised if it doesn't work for lines where spaces are different between file1 and file2 (though, as with most "hideous" scripts, that can be fixed with just another link in the pipeline). It's more here as an example of the ghastly things you can create for quick'n'dirty jobs.

This is not what I would do for production-quality code but it's fine for a once-off, provided you destroy all evidence of it before The Daily WTF finds out about it :-)

paxdiablo 2010-04-12 02:40:25

Answer 2

+1 A:

It's probably easiest to combine the first three fields with awk:

awk '{print $1 "_" $2 "_" $3 " " $4}' filename

Then you can use join normally on "field 1"

Michael Mrozek 2010-04-12 02:44:12

Useless use of cat.

Dennis Williamson 2010-04-12 04:38:44

Oh yeah, I forgot awk takes a filename. Fixed

Michael Mrozek 2010-04-12 04:45:31

Answer 3

+1 A:

Here is a way to do it in Perl:

#!/usr/local/bin/perl
use warnings;
use strict;
open my $file1, "<", "file1" or die $!;
my %file1keys;
while (<$file1>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    $file1keys{$keys[0]}{$keys[1]}{$keys[2]} = [$., $_];
}
close $file1 or die $!;
open my $file2, "<", "file2" or die $!;
while (<$file2>) {
    my @keys = split /\s+/, $_;
    next unless @keys;
    if (my $found = $file1keys{$keys[0]}{$keys[1]}{$keys[2]}) {
        print "Keys occur at file1:$found->[0] and file2:$..\n";
    }
}
close $file2 or die $!;

Kinopiko 2010-04-12 02:46:36

Answer 4

+1 A:

you can try this

awk '{
 o1=$1;o2=$2;o3=$3
 $1=$2=$3="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output

$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3

If you want to omit uncommon lines

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[$1$2$3] = s
 next
}
{
  printf $1 FS $2 FS $3 FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[$1$2$3]"\n"
 } ' file2 file1

output

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00

ghostdog74 2010-04-12 02:49:18

Answer 5

+2 A:

Jonathan Leffler 2010-04-12 02:50:19

it doesn't work using GNU join.

ghostdog74 2010-04-12 03:00:22

@ghostdog74: yeah - see the rewritten answer. It spent some time deleted while I resolved the issues (and there was a period before you added your comment while it was deleted too; it's been deleted twice).

Jonathan Leffler 2010-04-12 05:24:33

Answer 6

A:

A professor I used to work with created a set of perl scripts that can perform a lot of database-like operations on column-oriented flat text files. It's called Fsdb. It can definitely do this, and it's especially worth looking into if this isn't just a one-off need (so you're not constantly writing custom scripts).

Tyler McHenry 2010-04-12 02:54:04

ansaurus

tags:

views:

answers:

Joining Multiple Fields Using Unix Join

related questions