You asked "The first passes the file NAMES to Perl and the second passes the file CONTENTS it seems. Is this always true under Unix or a special property of Perl?" This behavior is not specific to Perl. Part of it is being done by Unix. It is more of a widely followed convention. The pipeline behavior (commands followed by |
) is being done by the OS. What a program does with its command-line input or the output it produces is command specific.
Examples. Please follow through on your computer in Bash.
$ mkdir pipetestdir; cd pipetestdir
$ for f in {a..z}; do printf "%s\n" "File: $f, line: "{1..1000} > $f.txt; done
That will create an empty directory, cd into it, and create 26 files of 1000 lines each in your empty directory.
With the Ubuntu / Linux utility cat *.txt
your can see the contents of the files. The *.txt
is expanded by Bash to all 26 .txt
files. with wc -l *.txt
you can verify the line count of all 26 files. You can use the form of wc -l {a..e}.txt
where Bash uses brace expansions. You can those forms around to a pipe and use cat *.txt | wc -l
to just get a single line count of all 26 files. In the first example, wc -l *.txt
is opening 26 files, counting the lines, and displaying the result. In the second example of cat *.txt | wc -l
the program cat
is opening the 26 files and producing a concatenated text stream to STDOUT; the |
turns that into a pipe that is directed to the next program; in this case wc -l
which receives that output on its STDIN and counts the lines of that without any regard to the separate files.
With Perl one liners, you can easily search these files. Example:
$ perl -lne 'print if /^.*666/' *.txt # the devil's line from 26 files...
You could use egrep
or awk
to do the same:
$ egrep '^.*666$' *.txt
$ awk "/^.*666$/ {print}" *.txt
If you turn that that form into a pipe, you are operating on OUTPUT of the previous command to the left of Perl (or awk or egrep). The output of the previous part's STDOUT is being feed to Perl's STDIN. If that command produces file names, you are operating on file names:
$ ls *.txt | perl -lne 'print if /c|d|z/'
$ find . -name '*.txt' | perl -lne 'print if /c|d|z/'
Unless you expanded them first with cat
:
$ cat *.txt | perl -lne 'print if /^.*?(c|d|z).*?666$/'
Which is similar output to this:
$ perl -lne 'print if /^.*?(c|d|z).*?666$/' *.txt
Perhaps this is where you got confused about the forms being interchangeable? They are not! Two very different things are going on. If you use cat *.txt | perl '...'
all the files are being conCATenated into one long text stream and sent to the next stage in the pipeline; in this case perl '...'
. Perl would not be able to distinguish which text came from which file. It is only because we put a mark in each file when we created them that we can see which file is which.
In the other form, perl '...' *.txt
, perl is opening the files and has full control over each text stream and file. You can control if you open the file or not, print the file name or not, etc...
Avoid, however, the specific form of cat a.txt | perl '...'
(ie, use cat on a single file) to avoid the dreaded Useless Use of Cat Award :-}
You asked specifically about the form:
$ perl -nle '... # same yada yada' `find . -type f`
As brian d foy pointed out, there are limitations on the command line length and you should be wary of this form. You can also have file names break in unexpected ways with back ticks. Rather than the back tick form, use find
with xargs
:
$ find . -type f -print0 | xargs -0 perl -nle 'print if /^.*666$/'
And to see the issue with breaking filenames, type these commands:
$ mv z.txt "file name with spaces"
$ perl -ple '' `find . -name "file*"` #fails...
$ find . -name "file*" -print0 | xargs -0 perl -ple '' #works...
$ find . -type f -exec perl -wnl -e '/\s1$/ and print' {} + #alternative