views:

392

answers:

5

This shell script is used to extract a line of data from $2 if it contains the pattern $line.

$line is constructed using the regular expression [A-Z0-9.-]+@[A-Z0-9.-]+ (a simple email match), form the lines in file $1.

#! /bin/sh

clear

for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+"`
do
    echo `cat "$2" | grep -m 1 "\b$line\b"`
done

File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).

File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).

The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.

Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).

NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.

+1  A: 

If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" $1

Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so

grep -i -o -E "[A-Z0-9._-]+@[A-Z0-9.-]+" $1
Jan
Don't think you can have _ in domain names. So /[A-Z0-9._-]+@[A-Z0-9.-]+/ would be best
rjstelling
You're right. I removed the underscore from the domain part.
Jan
pattern and filename should be switched, says my `man grep`.
Boldewyn
Ok, updated the answer.
Jan
+6  A: 
John Kugelman
a little bit of performance improvement , would be to omit the use of while read loop entirely. This is because if say, 1000 email addresses are found, the script will call grep on file2 1000 times. grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" file1 > tempgrep -f temp file2
ghostdog74
Using your final solution grep gave a ***out of memory*** error.
rjstelling
+1  A: 

the problem is you are piping too many shell commands, as well as unnecessary use of cat.

one possible solution using just awk

awk 'FNR==NR{
    # get all email address from file1
    for(i=1;i<=NF;i++){
        if ( $i ~ /[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+/){
            email[$i]
        }
    }
    next
}
{
 for(i in email) {
    if ($0 ~ i) {
        print 
    }
 }
}' file1 file2
ghostdog74
+1  A: 

I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)

To allow for you to take the loop out First create a file of all your Email Addresses with your outer grep command. Then use this as a pattern file to do your secondary grep by using grep -f

Brian
+1  A: 

As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.

First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line

for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+"`

may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.

Mikael Auno