I'm looking for a method to grep for multiple atoms e.g. "foo" and "bar".
I'm aware i can use
grep 'foo' file | grep 'bar'
to get both of them but i was wondering if there was a more efficient way. Any googleing seems to only throw results for an 'or' based search rather than 'and'.
views:
552answers:
3egrep '(foo.*bar|bar.*foo)'
# or
grep -E '(foo.*bar|bar.*foo)'
This might be efficient. :) Loading grep
again is probably free due to caching at the file system level. And, assuming the number of hits is small (in comparison to the number of lines of input), and most lines that contain 'foo' are going to be hits for 'bar' too, the second instance of grep
doesn't have a lot do to.
I doubt you'll get a more efficient way than the one you've selected. Given that the grep executable will already be mapped into memory when the second copy runs, and that you have no backtracking in your regex (unlike an egrep 'foo.*bar|bar.*foo'
obvious solution), I think what you have is a fast as you're going to get.
Here's some sample timings to illustrate the point:
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys 0m0.004s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.039s
user 0m0.000s
sys 0m0.000s
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys 0m0.008s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.005s
user 0m0.004s
sys 0m0.004s
From this admittedly small sample, the pipeline version takes less system and user CPU time, hence is more efficient.
The input file consists of 1000 copies of:
foo-bar
bar-dgfjhdgjhdgdfgdjghdjghdfg-foo
so you can run your own tests.
Her's the same test with 100,000 lines of input - you can see the questioners method is more efficient:
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
100000
real 0m0.135s
user 0m0.136s
sys 0m0.012s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
100000
real 0m0.034s
user 0m0.048s
sys 0m0.012s
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
100000
real 0m0.151s
user 0m0.144s
sys 0m0.000s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
100000
real 0m0.046s
user 0m0.044s
sys 0m0.012s