ansaurus

Question

Answer 1

+1 A:

If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like

grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" $1

Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so

grep -i -o -E "[A-Z0-9._-]+@[A-Z0-9.-]+" $1

Jan 2009-07-08 12:38:13

Don't think you can have _ in domain names. So /[A-Z0-9._-]+@[A-Z0-9.-]+/ would be best

rjstelling 2009-07-08 12:43:41

You're right. I removed the underscore from the domain part.

Jan 2009-07-08 12:48:42

pattern and filename should be switched, says my `man grep`.

Boldewyn 2009-07-08 12:59:19

Ok, updated the answer.

Jan 2009-07-08 14:21:05

Answer 2

+6 A:

John Kugelman 2009-07-08 12:40:55

a little bit of performance improvement , would be to omit the use of while read loop entirely. This is because if say, 1000 email addresses are found, the script will call grep on file2 1000 times. grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+" file1 > tempgrep -f temp file2

ghostdog74 2009-07-08 12:50:19

Using your final solution grep gave a ***out of memory*** error.

rjstelling 2009-07-09 11:26:40

Answer 3

+1 A:

the problem is you are piping too many shell commands, as well as unnecessary use of cat.

one possible solution using just awk

awk 'FNR==NR{
    # get all email address from file1
    for(i=1;i<=NF;i++){
        if ( $i ~ /[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+/){
            email[$i]
        }
    }
    next
}
{
 for(i in email) {
    if ($0 ~ i) {
        print 
    }
 }
}' file1 file2

ghostdog74 2009-07-08 12:42:47

Answer 4

+1 A:

I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)

To allow for you to take the loop out First create a file of all your Email Addresses with your outer grep command. Then use this as a pattern file to do your secondary grep by using grep -f

Brian 2009-07-08 12:49:43

Answer 5

+1 A:

As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.

First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line

for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+@[A-Z0-9.-]+"`

may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.

Mikael Auno 2009-07-08 12:49:46

ansaurus

tags:

views:

answers:

Improving Shell Script Performance

related questions