views:

998

answers:

4

I'm processing a huge file with (GNU) awk, (other available tools are: Linux shell tools, some old (>5.0) version of Perl, but can't install modules).

My problem: if some field1, field2, field3 contain X, Y, Z I must search for a file in another directory which contains field4, and field5 on one line, and insert some data from the found file to the current output.

E.g.:

Actual file line:

f1 f2 f3 f4 f5
X  Y  Z  A  B

Now I need to search for another file (in another directory), which contains e.g.

f1 f2 f3 f4
A  U  B  W

And write to STDOUT $0 from the original file, and f2 and f3 from the found file, then process the next line of the original file.

Is it possible to do it with awk?

Thanks in advance.

A: 

This seems to work for some test files I set up matching your examples. Involving perl in this manner (interposed with grep) is probably going to hurt the performance a great deal, though...

## perl code to do some dirty work

for my $line (`grep 'X Y Z' myhugefile`) {
    chomp $line;
    my ($a, $b, $c, $d, $e) = split(/ /,$line);
    my $cmd = 'grep -P "' . $d . ' .+? ' . $e .'" otherfile';
    for my $from_otherfile (`$cmd`) {
        chomp $from_otherfile;
        my ($oa, $ob, $oc, $od) = split(/ /,$from_otherfile);
        print "$a $ob $oc\n";
    }
}

EDIT: Use tsee's solution (above), it's much more well-thought-out.

Adam Bellaire
I'll try it out on Monday, thanks.
Zsolt Botykai
Involving perl doesn't hurt the performance at all! Calling shell commands using backticks from perl (like you do) ruins performance. If you'd use the shell-typical idiom of piping things through lots of programs or calling many extra processes, you're going to send performance down the john.
tsee
You're quite correct, tsee. I meant that involving perl in this particular way was going to hurt performance. Your properly written script is much better.
Adam Bellaire
Sorry for being so blunt, I misunderstood your post's second sentence. Cheers,
tsee
+2  A: 

Let me start out by saying that your problem description isn't really that helpful. Next time, please just be more specific: You might be missing out on much better solutions.

So from your description, I understand you have two files which contain whitespace-separated data. In the first file, you want to match the first three columns against some search pattern. If found, you want to find all lines in another file which contain the fourth and and fifth column of the matching line in the first file. From those lines, you need to extract the second and third column and then print the first column of the first file and the second and third from the second file. Okay, here goes:

#!/usr/bin/env perl -nwa
use strict;
use File::Find 'find';
my @search = qw(X Y Z);

# if you know in advance that the otherfile isn't
# huge, you can cache it in memory as an optimization.

# with any more columns, you want a loop here:
if ($F[0] eq $search[0]
    and $F[1] eq $search[1]
    and $F[2] eq $search[2])
{
  my @files;
  find(sub {
      return if not -f $_;
      # verbatim search for the columns in the file name.
      # I'm still not sure what your file-search criteria are, though.
      push @files, $File::Find::name if /\Q$F[3]\E/ and /\Q$F[4]\E/;
      # alternatively search for the combination:
      #push @files, $File::Find::name if /\Q$F[3]\E.*\Q$F[4]\E/;
      # or search *all* files in the search path?
      #push @files, $File::Find::name;
    }, '/search/path'
  )
  foreach my $file (@files) {
    open my $fh, '<', $file or die "Can't open file '$file': $!";
    while (defined($_ = <$fh>)) {
      chomp;
      # order of fields doesn't matter per your requirement.
      my @cols = split ' ', $_;
      my %seen = map {($_=>1)} @cols;
      if ($seen{$F[3]} and $seen{$F[4]}) {
        print join(' ', $F[0], @cols[1,2]), "\n";
      }
    }
    close $fh;
  }
} # end if matching line

Unlike another poster's solution which contains lots of system calls, this doesn't fall back to the shell at all and thus should be plenty fast.

tsee
Sorry about not specifying correctly. I'll try your solution as well at work. One question: how to solve, that the name of otherfile (t.txt in your answer) is unknown: so I need to search fpr a file which matches my criterias?
Zsolt Botykai
What are your criteria for the file name?What you should do is: use File::Find. It's a module for recursively traversing directories. It's been in perl 5.0, so you can safely use it.
tsee
This is a much better solution than my hack, which would load the entire contents of both greps into memory and (probably) be painfully slow. It would be nice to see the addition of File::Find for a complete solution.
Adam Bellaire
+1  A: 

This is the type of work that got me to move from awk to perl in the first place. If you are going to accomplish this, you may actually find it easier to create a shell script that creates awk script(s) to query and then update in separate steps.

(I've written such a beast for reading/updating windows-ini-style files - it's ugly. I wish I could have used perl.)

Tanktalus
+1  A: 

I often see the restriction "I can't use any Perl modules", and when it's not a homework question, it's often just due to a lack of information. Yes, even you can use CPAN contains the instructions on how to install CPAN modules locally without having root privileges. Another alternative is just to take the source code of a CPAN module and paste it into your program.

None of this helps if there are other, unstated, restrictions, like lack of disk space that prevent installation of (too many) additional files.

Corion
You are almost right, except the case of - the very rigorous sysadmins- in a very big bank- on a live system- where I just got a call for not properly logging what I done with a file - my .vimrc- without any internet connection to the machine- and I have to ask admins to upload files...
Zsolt Botykai