tags:

views:

127

answers:

3

While learning Perl I am also learning Linux (Ubuntu), so it is kinda fire-hose sipping time around here.

What is the difference between:

find . -type f | perl -nle '... #aka yada yada'

and

perl -nle '... # same yada yada' `find . -type f`

The first passes the file NAMES to Perl and the second passes the file CONTENTS it seems. Is this always true under Unix or a special property of Perl?

+3  A: 

The first sends the file names, one per line, to the program's STDIN which -n causes perl to loop over (because there were no commandline arguments).

The second calls perl with a list of file names as arguments. If arguments are passed in -n will open each argument and read each line from each file.

So the first operates on the names of the files, and the second operates on the contents of the files.

You can see the code perl is writing for you using by using B::Deparse:

perl -MO=Deparse -nle 'print'

produces

BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    print $_;
}
-e syntax OK

The BEGIN block and the chomp are created by the -l option and the while loop is created by the -n option. ARGV is a special filehandle that performs the magic of reading from STDIN if no arguments are present or opening each of the arguments in turn if there are.

The two forms are definitely not interchangeable. One affects STDIN and other the commandline arguments. If you change the first one to be find . -type f | xargs perl -nle '... #aka yada yada' then they would be mostly interchangeable (the xargs version might run perl more than once and the backtick version might just blow up because the commandline was too long).

Many UNIX programs act as filters. A rule for filters is that they read from STDIN if handed no files on the commandline, or from a list of files given on the commandline. A short list includes cat, grep, and sort. Perl 5 makes implementing a filter easy, as you have seen. But be warned, the way Perl 5 implements this is not very safe. It uses the outdated two argument version of open, which means that certain filenames can have unintended consequences:

perl -nle print "cat /etc/passwd|"

That command actually runs cat /etc/passwd instead of opening the file named cat /etc/passwd|. To prevent this behavior, it is advisable to examine @ARGV for suspicious names or use the ARGV::readonly module to clean @ARGV for you:

perl -MARGV::readonly -nle print "echo foo|"
Can't open < echo foo|: No such file or directory.
Chas. Owens
@Chas. Owens: Thanks! I did observe what you have stated, but this is my question is: Is that a property of Perl or Unix? I thought that the forms were interchangeable. Is it because in the first case the stream is coming in on STDIN and the second case the files end up in ARGV? Is it the presence or lack of command line arguments that causes the files to be opened or not?
carrot-top
That's a perl feature, implemented by its special `ARGV` handle.
rafl
@Chas. Owens: The `-MO=Deparse` is helping me understand this a LOT. Thanks...
carrot-top
+6  A: 

The first one generates the list of files and "pipes" it to perl. perl then reads the list by reading from standard input:

 while( <> ) { ... }

This is a common thing to do in unix shells, so you don't have to use perl at all:

 $ ifconfig | grep en0

The second one generates the list of file names and turns them into command-line arguments, which then show up in your program in @ARGV:

 foreach( @ARGV ) { ... }

This is a feature is not particular to Perl either. The shell provides the bits after the command in some sort of data structure that the program can access. Other languages have similar constructs even if they don't look the same.

However, the diamond operator, <> will automatically go through the filenames you specify on the command line, so that while loop still works. This is a feature particular to Perl.

The problem with the second approach tends to show up when you have a long list of arguments. Some shells limit the number of things that can show up on the command line. I don't like the second version as much just for that reason.

However, instead of using find(1) (the shell version), you can turn it into a self-contained Perl program:

$ find2perl . -type f

The output is a Perl program that doesn't have to rely on any external commands.

brian d foy
+1  A: 

You asked "The first passes the file NAMES to Perl and the second passes the file CONTENTS it seems. Is this always true under Unix or a special property of Perl?" This behavior is not specific to Perl. Part of it is being done by Unix. It is more of a widely followed convention. The pipeline behavior (commands followed by |) is being done by the OS. What a program does with its command-line input or the output it produces is command specific.

Examples. Please follow through on your computer in Bash.

$ mkdir pipetestdir; cd pipetestdir    
$ for f in {a..z}; do printf "%s\n" "File: $f, line: "{1..1000} > $f.txt; done

That will create an empty directory, cd into it, and create 26 files of 1000 lines each in your empty directory.

With the Ubuntu / Linux utility cat *.txt your can see the contents of the files. The *.txt is expanded by Bash to all 26 .txt files. with wc -l *.txt you can verify the line count of all 26 files. You can use the form of wc -l {a..e}.txt where Bash uses brace expansions. You can those forms around to a pipe and use cat *.txt | wc -l to just get a single line count of all 26 files. In the first example, wc -l *.txt is opening 26 files, counting the lines, and displaying the result. In the second example of cat *.txt | wc -l the program cat is opening the 26 files and producing a concatenated text stream to STDOUT; the | turns that into a pipe that is directed to the next program; in this case wc -l which receives that output on its STDIN and counts the lines of that without any regard to the separate files.

With Perl one liners, you can easily search these files. Example:

$ perl -lne 'print if /^.*666/' *.txt    # the devil's line from 26 files...

You could use egrep or awk to do the same:

$ egrep '^.*666$' *.txt
$ awk "/^.*666$/ {print}" *.txt

If you turn that that form into a pipe, you are operating on OUTPUT of the previous command to the left of Perl (or awk or egrep). The output of the previous part's STDOUT is being feed to Perl's STDIN. If that command produces file names, you are operating on file names:

$ ls *.txt | perl -lne 'print if /c|d|z/'
$ find . -name '*.txt' | perl -lne 'print if /c|d|z/'

Unless you expanded them first with cat:

$ cat *.txt | perl -lne 'print if /^.*?(c|d|z).*?666$/'

Which is similar output to this:

$ perl -lne 'print if /^.*?(c|d|z).*?666$/' *.txt

Perhaps this is where you got confused about the forms being interchangeable? They are not! Two very different things are going on. If you use cat *.txt | perl '...' all the files are being conCATenated into one long text stream and sent to the next stage in the pipeline; in this case perl '...'. Perl would not be able to distinguish which text came from which file. It is only because we put a mark in each file when we created them that we can see which file is which.

In the other form, perl '...' *.txt, perl is opening the files and has full control over each text stream and file. You can control if you open the file or not, print the file name or not, etc...

Avoid, however, the specific form of cat a.txt | perl '...' (ie, use cat on a single file) to avoid the dreaded Useless Use of Cat Award :-}

You asked specifically about the form:

$ perl -nle '... # same yada yada' `find . -type f`

As brian d foy pointed out, there are limitations on the command line length and you should be wary of this form. You can also have file names break in unexpected ways with back ticks. Rather than the back tick form, use find with xargs:

$ find . -type f -print0 | xargs -0 perl -nle 'print if /^.*666$/'

And to see the issue with breaking filenames, type these commands:

$ mv z.txt "file name with spaces" 
$ perl -ple '' `find . -name "file*"`       #fails...
$ find . -name "file*" -print0 | xargs -0 perl -ple '' #works...
$ find . -type f -exec perl -wnl -e '/\s1$/ and print' {} + #alternative
drewk