views:

1043

answers:

4

The following is a simple bash command line:

grep -li 'regex' "filename with spaces" "filename"

No problems. Also the following works just fine:

grep -li 'regex' $(<listOfFiles.txt)

where listOfFiles.txt contains a list of filenames to be grepped, one filename per line.

The problem occurs when listOfFiles.txt contains filenames with embedded spaces. In all cases I've tried (see below) bash splits the filenames at the spaces so, for example, a line in listOfFiles.txt containing a name like ./this is a file.xml ends up trying to run grep on each piece (./this, is, a and file.xml).

I thought I was a relatively advanced bash user, but I cannot find a simple magic incantation to get this to work. Here are the things I've tried.

grep -li 'regex' `cat listOfFiles.txt`

Fails as described above (I didn't really expect this to work), so I thought I'd put quotes around each filename:

grep -li 'regex' `sed -e 's/.*/"&"/' listOfFiles.txt`

Bash interprets the quotes as part of the filename and gives "No such file or directory" for each file (and still splits the filenames with blanks)

for i in $(<listOfFiles.txt); do grep -li 'regex' "$i"; done

This fails as for the original attempt (i.e. it behaves as if the quotes are ignored) and is very slow since it has to launch one grep process per file instead of processing all files in one invocation.

The following works, but requires some careful double-escaping if the regex contains shell metacharacters:

eval grep -li 'regex' `sed -e 's/.*/"&"/' listOfFiles.txt`

Is this the only way to construct the command line so it will correctly handle filenames with spaces?

+10  A: 

Try this:

tmpIFS=$IFS; IFS='\n'; grep -li 'regex' $(<listOfFiles.txt); IFS=$tmpIFS

IFS is the Internal Field Separator. It tells Bash which characters delimit filenames. Usually it contains the space, tab and newline. By restricting it to the newline for this single command (note that IFS is stored in a temporary variable), spaces are no longer treated as delimiters.

As Jefromi correctly points out below, the temporary variable can be avoided by placing the command inside parentheses, like so:

(IFS='\n'; grep -li 'regex' $(<listOfFiles.txt))
Stephan202
You need to save `$IFS` and restore it.
Dennis Williamson
@Dennis: you're right. Fixed.
Stephan202
Even if you don't export it, the value will persist if this wasn't run in a subshell. Just try running `FOO=bar; echo $FOO` on one line, then `echo $FOO` on another. Subshells are automatically started for commands in pipelines, but `IFS=$'\n'` is of course not part of a pipeline here. The best solution is to surround the whole statement with parentheses, which manually tell bash to run the command in a subshell.
Jefromi
I still prefer running this in a subshell, but if you do want to save/restore IFS, you definitely want to quote the variable expansions.
Jefromi
@Jefromi: do you agree with the addendum as it is?
Stephan202
Yup! Much shorter and less error-prone.
Jefromi
Thanks guys. Time for me to go back and reread the bash man page from top to bottom again :-)
Jim Garrison
Another way to have a variable assignment only apply temporarily is to use a space between the variable assignment and the command it applies to: `IFS=$'\n' grep -li 'regex' $(<listOfFiles.txt)`
Dennis Williamson
@Dennis: I thought that as well, but for some reason that does not work on my machine. Perhaps that the scope of `IFS=$'\n'` is so limited in this case, that it doesn't even apply to `$(<listOfFiles.txt)`?
Stephan202
@Dennis: that doesn't work because the environment is only set after the processing of the argument list - so 'grep' sees the correct value of IFS, but the shell that is processing the argument list does not.
Jonathan Leffler
@Jonathan: thanks for clarifying that :)
Stephan202
+4  A: 

This works:

while read file; do grep -li dtw "$file"; done < listOfFiles.txt
Dennis Williamson
+3  A: 
cat listOfFiles.txt |tr '\n' '\0' |xargs -0 grep -li 'regex'

The -0 option on xargs tells xargs to use a null character rather than white space as a filename terminator. The tr command converts the incoming newlines to a null character.

This meets the OP's requirement that grep not be invoked multiple times. It has been my experience that for a large number of files avoiding the multiple invocations of grep improves performance considerably.

This scheme also avoids a bug in the OP's original method because his scheme will break where listOfFiles.txt contains a number of files that would exceed the buffer size for the commands. xargs knows about the maximum command size and will invoke grep multiple times to avoid that problem.

A related problem with using xargs and grep is that grep will prefix the output with the filename when invoked with multiple files. Because xargs invokes grep with multiple files one will receive output with the filename prefixed, but not for the case of one file in listOfFiles.txt or the case of multiple invocations where the last invocation contains one filename. To achieve consistent output add /dev/null to the grep command:

cat listOfFiles.txt |tr '\n' '\0' |xargs -0 grep -i 'regex' /dev/null

Note that was not an issue for the OP because he was using the -l option on grep; however it is likely to be an issue for others.

Michael Potter
A: 

Though it may overmatch, this is my favorite solution:

grep -i 'regex' $(cat listOfFiles.txt | sed -e "s/ /?/g")

Chris Thiessen