views:

875

answers:

8

I am writing something that will allow users to search through a log of theirs. Currently I have this, where $in{'SEARCH'} is the string they are searching.

open(COMMAND, "grep \"$in{'SEARCH'}\" /home/$palace/palace/logs/$logfile | tail -n $NumLines |");
    $f = <COMMAND>;
    if  ($f) {
     print $Title;
     print "<div id=log>\n";
      do {  print $f."<br>";} while ($f = <COMMAND>);
     print "</div>\n";
    } else {print $Error; }
close(COMMAND);

However I noticed that they could easily fool the script and the grep command to error by using a double quote (") or a backslash. Therefore I added this piece of code:

$in{'SEARCH'} =~ s|\\|\\\\|g;
$in{'SEARCH'} =~ s|"|\Q\"\E|g;

open(COMMAND, "grep \"$in{'SEARCH'}\" /home/$palace/palace/logs/$logfile | tail -n $NumLines |");
    $f = <COMMAND>;
    if  ($f) {
     print $Title;
     print "<div id=log>\n";
      do {  print $f."<br>";} while ($f = <COMMAND>);
     print "</div>\n";
    } else {print $Error; }
close(COMMAND);

However, I am still having problems. The grep command does not like the \ in its search giving errors like

grep "\\" /home/test/palace/logs/chat.log
grep: Trailing backslash

Should I even continue to try and use the grep command, and if so, what is a good Perl function that helps on stripping out weird characters that will help the grep command, say as making " to be \" and etc. Or, instead of messing around should I just use Perl code to accomplish the same thing, even though I've read it wouldn't be as fast as grep?


Update: 7-5-2009 5:20 PM EST


Many people contributed code, especially those attempting to be faster than the system grep. As of right now, its still the fastest. Here is how the benchmarks go:

Using system grep:

Top   of  file: 1 wallclock secs ( 0.00 usr 0.01 sys + 0.13 cusr 0.15 csys = 0.29 CPU)
Bottom of file: 1 wallclock secs ( 0.00 usr 0.00 sys + 0.21 cusr 0.18 csys = 0.39 CPU)

Using Hypneks example (push and shift):

Top   of  file: 4 wallclock secs ( 3.78 usr + 0.19 sys = 3.97 CPU)
Bottom of file: 4 wallclock secs ( 3.86 usr + 0.19 sys = 4.05 CPU)

Using my perl example (using the reverse command):

Top   of  file: 6 wallclock secs ( 4.76 usr + 1.45 sys = 6.21 CPU)
Bottom of file: 5 wallclock secs ( 3.93 usr + 1.44 sys = 5.37 CPU)

Using my File::ReadBackwards:

Top   of  file:11 wallclock secs (11.20 usr + 0.11 sys = 11.31 CPU)
Bottom of file: 1 wallclock secs ( 0.59 usr + 0.01 sys = 0.60 CPU)

Using xcramps example (built in grep):

Top   of  file: 9 wallclock secs ( 8.04 usr + 0.47 sys = 8.51 CPU)
Bottom of file: 8 wallclock secs ( 8.16 usr + 0.43 sys = 8.59 CPU)
+3  A: 

Definitely just use Perl. Keeping all the work in a single process will speed things up, and invoking a subcommand with user specified args is just begging for exploits. If you do decide to stick with the subcommand quotemeta might be worth looking at.

Ry4an
Nice function, but the problem is that the search should also support regular expressions, therefore with quotemeta it turns * to /* which then doesn't match how it should at all.
BHare
+5  A: 

I'll agree with Ry4an that going with Perl is probably sensible.

If you want to fix up your code, you need to (should) use single quotes around the arguments that you don't want the shell messing with - not double quotes. As a for instance, either backticks or '$(...)' notation will execute commands inside double quotes, and '${variable}' and such like will be expanded. Dealing with all the notations is too much like hard work; using single quotes is much simpler. Note that embedded single quotes need to be replaced by the sequence "'\''" (that was fun!); the double quotes simply surround the four characters that you replace a single quote with (assuming the string as a whole is embedded in single quotes). The first quote closes the current quoted string; the backslash, single quote means a literal single quote; the last single quote starts a new single quoted string. You don't need to escape any other characters.


Taking your code:

my $search = $in{SEARCH};
$search =~ s/'/'\\''/g;

open(COMMAND, "grep '$search' /home/$palace/palace/logs/$logfile | tail -n $NumLines |");
$f = <COMMAND>;
if  ($f)
{
    print $Title;
    print "<div id=log>\n";
    do { print $f."<br>";} while ($f = <COMMAND>);
    print "</div>\n";
}
else
{
    print $Error;
}
close(COMMAND);
Jonathan Leffler
This is the solution I end up using. Single quotes was way easier than escaping the right characters. If you are looking for a solution that is strickly PERL then check out Hynek -Pichi- Vychodil's post. Hynek -Pichi- Vychodil had the fastest PERL code, but PERL's defualt regex system is slower than grep's. This is explained in blackkettle's comment with the website: swtch.com/~rsc/regexp/regexp1.html
BHare
+1  A: 

I did a quick perl writeup of how it would work, im not sure if this is optimized or the best way. But here it is:

my $num_matches = 0;
my $logdir = "/home/$palace/palace/logs/$logfile";
open my $fh, "<", "$logdir" or die "Can't open [$logdir]: $!";  


print $Title;
print "<div id=log>\n";


@lines = reverse <$fh>;
foreach $line (@lines) {
    #print "<b>$line</b> - $num_matches<br>";
    if ($line =~ m/\Q$in{'SEARCH'}\E/){
     print $line."<br>";
     $num_matches ++;
    }
    if ($num_matches >= $NumLines) {
     last;
    }
}

print "</div>\n";

I do want to note that I ran benchmarks (http://perldoc.perl.org/Benchmark.html) on the PERL way (above) vs my grep way (original question) these are the follow results:

PERL WAY: the code took: 4 wallclock secs ( 2.89 usr + 0.96 sys = 3.85 CPU)

GREP WAY: the code took: 1 wallclock secs ( 0.02 usr 0.04 sys + 0.31 cusr 0.14 csys = 0.51 CPU)

So i need to optimize my Perl code. Any ideas? I thought by reading the file in reverse, that it could find the matches ASAP and wouldnt need to read any further, saving it from reading the whole file to memory. However, I am guessing that the reverse command reads the file in memory anyways. Perhaps I should use system(tac $file) and read it, but again I am trying to avoid external command calls.

BHare
Don't read the entire file into an array to read backwards, use File::ReadBackwards instead.
ysth
I require only libraries that are included by default.
BHare
Update: I have came to the conclusion that not expanding my code to allow some CPAN modules is silly. The reason I originally decided not to is because I am writing a panel in webmin, and I would liked it to install like a normal webmin module, which to my understanding means the end-user would have to have to manually install the required CPAN modules. Instead I have decided to have a directory called cpan, and I downloaded the loan .pm file there, and then referenced to it.
BHare
Even with using File::ReadBackwards, it takes 16 seconds for PERL to read a match thats deep in the file, where as the grep | tail combo takes 0.1 seconds. Grrrrr
BHare
it possible that the discrepancy in terms of time has something to do with the regexp you are using. The grep regex engine employs the Thompson Algorithm for regex compilation and search, this is roughly linear in the input. However the Perl ( and Ruby, PHP, Python, etc. ) implementations employ an approach ( for different reasons ) which requires exponential time to process for certain inputs, http://swtch.com/~rsc/regexp/regexp1.html
blackkettle
Look at mine code (http://stackoverflow.com/questions/1083453/grep-vs-perl-should-i-mod-this-to-work-or-should-i-just-replace-it-with-perl-co/1083721#1083721) and try compare with your grep approach.
Hynek -Pichi- Vychodil
Try modifying your foreach statement for while, and remove the reverse statement. What it does is basically a) load the entire file in memory; and b) reverse it.What you need to use `while` instead.
Igor
Which version of Perl?
Brad Gilbert
Where are these defined ($palace,$logfile,$Title,%in)?
Brad Gilbert
`grep` doesn't read the whole file first, so why is your perl code doing that?
Brad Gilbert
$palace is basically their home dir's name.$logfile is the file being read, in this case a huge log file$Title is just some HTML that brings a <h1> thingand %in has variables from the POST like the search string itself, etc
BHare
+2  A: 

You can bypass shell-quoting issues completely by specifying the arguments separately:

open my $grepfh, "-|", "grep", "-F", $in{'SEARCH'}, "/home/$palace/palace/logs/$logfile" or die "grep open error: $! $?\n";

Including the tail, without any error checking:

open( my $resultfh, '-|' )
or do {
    pipe(my $rdfh, my $wrfh);
    if (fork) {    
        open( STDOUT, '>&', $wrfh );
        exec( "grep", @grepargs );
    }
    else {
        open( STDIN, '<&', $rdfh );
        exec( "tail", @tailargs );
    }
};
while (my $line = readline($resultfh)) {
    ....
}
ysth
This did not work for me.a) there is no tail commandb) your using -F, which is fast grep. I need regular expressions.c) never specifies what file to grep
BHare
a) yes, you'd have to do the tail separately b) I did -F based on your answer where you matched with /Q.../E; if you don't want -F, you don't want those in your perl match c) oops,added
ysth
+2  A: 

Try this approach:

my $re = qr/\Q$in{'SEARCH'}\E/;
my @lines;
while (<$fh>) {
    next unless m/$re/;
    push @lines, $_;
    shift @lines if @lines > $NumLines;
}

print @lines;
Hynek -Pichi- Vychodil
This is what I was talking about above. while (<$fh>) runs on scalar context, meaning it asks $fh one line at each time instead of using reverse, which will load the entire file into memory.
Igor
+1  A: 

In terms of performance, Perl is excelent. It may not be as fast as the grep process, but I think that the difference may not be noticeable by the end user.

Since one of your goals is control over the user input I sugest you go with Perl, beacuse it can give you as much control over the user input as you may need now or in the future.

jpbaudoin
A: 

Perl has a built in grep:

open(IN, $file) || die;
@result = grep {/\Q$in{'SEARCH'}\E/} <IN>;
close(IN);

# this is the equivalent of your tail
$end = $#result;
$begin = $end - $NumLines + 1;
$begin = 0 if $begin < 0;
print "@result[$begin..$end]";
xcramps
This is not bad at all, but the linux grep command still blows it away in terms of performance.
BHare
and the "shift" solution, how did it measure?
Massa
I just checked the benchmarks and I was actually a bit wrong on this one, it was one of the slowest. Please see my update in the question for benchmark times. Shift solution was the fastest besides system grep
BHare
A: 

Silly idea, but how about using grep with single quote when there is no single quote in the line, or using perl when there is? This way you get fast performance and security at the same time.

Arkadiy
My solution i have now using system grep, and I have it so it escapes the ', its a good solution. Now this thread is more about finding out why PERL isnt as fast as grep. But I think we found out based on the Thompson idea.
BHare