ansaurus

Question

Efficient grep method to match two atoms?

Answer 1

+2 A:

egrep '(foo.*bar|bar.*foo)'
# or
grep -E '(foo.*bar|bar.*foo)'

divideandconquer.se 2008-12-03 12:08:14

This is a one-liner which does what the questioner wants but I'm unconvinced that it's more efficient - questioners method doesn't involve backtracking of the regex and is likely to be faster, despite having two processes running. In fact, possibly much faster in a multi-processor environment.

paxdiablo 2008-12-03 12:18:02

I don't see how multi processors would make a difference, as far as i know (i might be making a bad assumption here) they would not run in parallel.

Simon Rowsby 2008-12-03 12:24:04

The speed benefit doesn't come from processes running in parallel (which they could, if the shell supports it), but from the fact that two literal text searches are far faster than a single search with a backtracking regex.

Jan Goyvaerts 2008-12-03 12:31:40

In a multi-processor system, the processes can run truly in parallel with the first grep continuing to look for foo's in the file while the second looks for bar's in it's input stream. The overlap is small but significant. Best case where every line matches foo, would approach half the time.

paxdiablo 2008-12-03 12:34:25

Except for the first and last line which are serialized.

paxdiablo 2008-12-03 12:34:57

Answer 2

+1 A:

This might be efficient. :) Loading grep again is probably free due to caching at the file system level. And, assuming the number of hits is small (in comparison to the number of lines of input), and most lines that contain 'foo' are going to be hits for 'bar' too, the second instance of grep doesn't have a lot do to.

unwind 2008-12-03 12:09:24

I think the assumption that most lines with foo also have bar is wrong, unless questioner means literal foo/bar, not some other strings. Your point about less lines given to the second grep is valid. Most greps produce sparse results (eg. 1% of input) so second grep would have lots less to do.

paxdiablo 2008-12-03 12:42:11

Answer 3

+4 A:

I doubt you'll get a more efficient way than the one you've selected. Given that the grep executable will already be mapped into memory when the second copy runs, and that you have no backtracking in your regex (unlike an egrep 'foo.*bar|bar.*foo' obvious solution), I think what you have is a fast as you're going to get.

Here's some sample timings to illustrate the point:

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys  0m0.004s

allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.039s
user 0m0.000s
sys  0m0.000s

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
2000
real 0m0.006s
user 0m0.004s
sys  0m0.008s

allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
2000
real 0m0.005s
user 0m0.004s
sys  0m0.004s

From this admittedly small sample, the pipeline version takes less system and user CPU time, hence is more efficient.

The input file consists of 1000 copies of:

foo-bar
bar-dgfjhdgjhdgdfgdjghdjghdfg-foo

so you can run your own tests.

Her's the same test with 100,000 lines of input - you can see the questioners method is more efficient:

allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
    100000
    real 0m0.135s
    user 0m0.136s
    sys  0m0.012s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
    100000
    real 0m0.034s
    user 0m0.048s
    sys  0m0.012s
allan@allan-desktop:~$ time egrep 'foo.*bar|bar.*foo' foobar | wc -l
    100000
    real 0m0.151s
    user 0m0.144s
    sys  0m0.000s
allan@allan-desktop:~$ time fgrep 'foo' foobar | fgrep 'bar' | wc -l
    100000
    real 0m0.046s
    user 0m0.044s
    sys  0m0.012s

paxdiablo 2008-12-03 12:21:46

ansaurus

tags:

views:

answers:

Efficient grep method to match two atoms?

related questions