tags:

views:

552

answers:

3

I'm looking for a method to grep for multiple atoms e.g. "foo" and "bar". I'm aware i can use
grep 'foo' file | grep 'bar'
to get both of them but i was wondering if there was a more efficient way. Any googleing seems to only throw results for an 'or' based search rather than 'and'.

+2  A: 
egrep '(foo.*bar|bar.*foo)'
# or
grep -E '(foo.*bar|bar.*foo)'
divideandconquer.se
This is a one-liner which does what the questioner wants but I'm unconvinced that it's more efficient - questioners method doesn't involve backtracking of the regex and is likely to be faster, despite having two processes running. In fact, possibly much faster in a multi-processor environment.
paxdiablo
I don't see how multi processors would make a difference, as far as i know (i might be making a bad assumption here) they would not run in parallel.
Simon Rowsby
The speed benefit doesn't come from processes running in parallel (which they could, if the shell supports it), but from the fact that two literal text searches are far faster than a single search with a backtracking regex.
Jan Goyvaerts
In a multi-processor system, the processes can run truly in parallel with the first grep continuing to look for foo's in the file while the second looks for bar's in it's input stream. The overlap is small but significant. Best case where every line matches foo, would approach half the time.
paxdiablo
Except for the first and last line which are serialized.
paxdiablo
+1  A: 

This might be efficient. :) Loading grep again is probably free due to caching at the file system level. And, assuming the number of hits is small (in comparison to the number of lines of input), and most lines that contain 'foo' are going to be hits for 'bar' too, the second instance of grep doesn't have a lot do to.

unwind
I think the assumption that most lines with foo also have bar is wrong, unless questioner means literal foo/bar, not some other strings. Your point about less lines given to the second grep is valid. Most greps produce sparse results (eg. 1% of input) so second grep would have lots less to do.
paxdiablo
+4  A: 

I doubt you'll get a more efficient way than the one you've selected. Given that the grep executable will already be mapped into memory when the second copy runs, and that you have no backtracking in your regex (unlike an egrep 'foo.*bar|bar.*foo' obvious solution), I think what you have is a fast as you're going to get.

Here's some sample timings to illustrate the point:

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys  0m0.004s

allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.039s
user 0m0.000s
sys  0m0.000s

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys  0m0.008s

allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.005s
user 0m0.004s
sys  0m0.004s

From this admittedly small sample, the pipeline version takes less system and user CPU time, hence is more efficient.

The input file consists of 1000 copies of:

foo-bar
bar-dgfjhdgjhdgdfgdjghdjghdfg-foo

so you can run your own tests.

Her's the same test with 100,000 lines of input - you can see the questioners method is more efficient:

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
    100000
    real 0m0.135s
    user 0m0.136s
    sys  0m0.012s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
    100000
    real 0m0.034s
    user 0m0.048s
    sys  0m0.012s
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
    100000
    real 0m0.151s
    user 0m0.144s
    sys  0m0.000s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
    100000
    real 0m0.046s
    user 0m0.044s
    sys  0m0.012s
paxdiablo