views:

249

answers:

5

Hi all,

I've got a strangely acting egrep -f.

Example:

$ egrep -f ~/tmp/tmpgrep2 orig_20_L_A_20090228.txt | wc -l
3
$ for lines in `cat ~/tmp/tmpgrep2` ; do  egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
12

Could someone give me a hint what could be the problem? No, the files did not changed between executions. The expected answer for the egrep line count is 12.

UPDATE on file contents: the searched file contains cca 13000 lines, each of them are 500 char long, the pattern file contains 12 lines, each of them are 24 char long. The pattern always (and only) occurs on a fixed position in the seached file (26-49).

UPDATE on pattern contents: every pattern from tmpgrep2 are a 24 char long number.

+1  A: 

Could it be that the lines read contain something that the shell is expanding/substituting for you, in the second version? Then that doesn't get done by grep when it reads the patterns itself, thus leading to a different sent of patterns being matched.

I'm not totally sure if the shell is doing any expansion on the variable value in an invocation like that, but it's an idea at least.

EDIT: Nope, it doesn't seem to do any substitutions. But it could be quoting issue, if your patterns contain whitespace the for loop will step through each token, not through each line. Take a look at the read bash builtin.

unwind
It can be a possible cause (+1), but not in this case, see my update on patterns.
Zsolt Botykai
A: 

I second @unwind.

Why don't you run without wc -l and see what each search is finding?

And maybe:

for lines in `cat ~/tmp/tmpgrep2` ; do echo $lines ; done

Just to see now the shell is handling $lines?

Douglas Leeder
I did so. The wc -l was just added to show that it acts strangely.
Zsolt Botykai
+2  A: 

If the search patterns are found on the same lines, then you can get the result you see:

Suppose you look for:

abc
def
ghi
jkl

and the data file is:

abcdefghijklmnoprstuvwxzy

then the one-time command will print 1 and the loop will print 4.

Jonathan Leffler
Possible cause of the problem (+1 vote), but that's not the case. See my update on where can the pattern occur.
Zsolt Botykai
Then 'tis time to get wc out of the system and look at the results of egrep in the raw form. You might also want to use the '-n' option to report line numbers. Since you only get a dozen lines of output, it isn't going to be too bad. You can perhaps use 'cut -c1-70' so the long lines are shorter.
Jonathan Leffler
Are there regular expression metacharacters in the data you're trying to match? That could confuse things, too. But you are now, probably, left with manual analysis of the two separate sets of results. FWIW: the last thing to think of is 'bug in egrep'; that is most unlikely.
Jonathan Leffler
+1  A: 

Do you have any duplicates in ~/tmp/tmpgrep2? Egrep will only use the dupes one time, but your loop will use each occurrence.

Get rid of dupes by doing something like this:

$ for lines in `sort < ~/tmp/tmpgrep2 | uniq` ; do  egrep $lines orig_20_L_A_20090228.txt ; done | wc -l
bstpierre
+1 as this can be the cause, but not in my case. There are no duplicates in my pattern file.
Zsolt Botykai
A: 

The others have already come up with most of the things I would look at. The next thing I would check is the environment variable GREP_OPTIONS, or whatever it is called on your machine. I've gotten the strangest error messages or behaviors when using a command line argument that interfered with the environment settings.

Harold Bamford