views:

1266

answers:

10

I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.

If I were to rewrite the script in Perl, could I achieve a significant speed increase - or would I have to move to something like C to accomplish this?

#!/bin/bash

if [ $# -ne 3 ]; then
  echo "USAGE $0 <logfile(s)> <from date (epoch)> <to date (epoch)>"
  exit 1
fi

LOGFILES=$1
FROM=$2
TO=$3
rm -f /tmp/getlogs??????
TEMP=`mktemp /tmp/getlogsXXXXXX`

## LOGS NEED TO BE LISTED CHRONOLOGICALLY
ls -lnt $LOGFILES|awk '{print $8}' > $TEMP
LOGFILES=`tac $TEMP`
cp /dev/null $TEMP

findEntry() {
  RETURN=0
  dt=$1
  fil=$2
  ln1=$3
  ln2=$4
  t1=`tail -n+$ln1 $fil|head -n1|cut -c1-15`
  dt1=`date -d "$t1" +%s`
  t2=`tail -n+$ln2 $fil|head -n1|cut -c1-15`
  dt2=`date -d "$t2" +%s`
  if [ $dt -ge $dt2 ]; then
    mid=$dt2
  else
    mid=$(( (($ln2-$ln1)*($dt-$dt1)/($dt2-$dt1))+$ln1 ))
  fi
  t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
  dt3=`date -d "$t3" +%s`
  # finished
  if [ $dt -eq $dt3 ]; then
    # FOUND IT (scroll back to the first match)
    while [ $dt -eq $dt3 ]; do
      mid=$(( $mid-1 ))
      t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
      dt3=`date -d "$t3" +%s`
    done
    RETURN=$(( $mid+1 ))
    return
  fi
  if [ $(( $mid-1 )) -eq $ln1 ] || [ $(( $ln2-1)) -eq $mid ]; then
    # FOUND NEAR IT
    RETURN=$mid
    return
  fi
  # not finished yet
  if [ $dt -lt $dt3 ]; then
    # too high
    findEntry $dt $fil $ln1 $mid
  else
    if [ $dt -ge $dt3 ]; then
      # too low
      findEntry $dt $fil $mid $ln2
    fi
  fi
}

# Check timestamps on logfiles
LOGS=""
for LOG in $LOGFILES; do
  filetime=`ls -ln $LOG|awk '{print $6,$7}'`
  timestamp=`date -d "$filetime" +%s`
  if [ $timestamp -ge $FROM ]; then
    LOGS="$LOGS $LOG"
  fi
done

# Check first and last dates in LOGS to refine further
for LOG in $LOGS; do
    if [ ${LOG%.gz} != $LOG ]; then
      gunzip -c $LOG > $TEMP
    else
      cp $LOG $TEMP
    fi
    t=`head -n1 $TEMP|cut -c1-15`
    FIRST=`date -d "$t" +%s`
    t=`tail -n1 $TEMP|cut -c1-15`
    LAST=`date -d "$t" +%s`
    if [ $TO -lt $FIRST ] || [ $FROM -gt $LAST ]; then
      # This file is entirely out of range
      cp /dev/null $TEMP
    else
      if [ $FROM -le $FIRST ]; then
        if [ $TO -ge $LAST ]; then
          # Entire file is within range
          cat $TEMP
        else
          # Last part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          head -n$RETURN $TEMP
        fi
      else
        if [ $TO -ge $LAST ]; then
          # First part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          tail -n+$RETURN $TEMP
        else
          # range is entirely within this logfile
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n1=$RETURN
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n2=$RETURN
          tail -n+$n1 $TEMP|head -n$(( $n2-$n1 ))
        fi
      fi
    fi
done
rm -f /tmp/getlogs??????


+17  A: 

You will almost certainly realize a massive speed benefit from writing your script in Perl just by cutting off the file read when you pass your second timestamp.

More generally, yes; a bash script of any complexity, unless it's a truly amazing piece of wizardry, can handily be outperformed by a Perl script for equivalent inputs and outputs.

chaos
right on. in most cases, different language's performance isn't because of the language implementation; but because it lets you apply better algorithms.
Javier
I have added my original bash script to the question, if it helps
Brent
+17  A: 

Perl is absurdly faster than Bash. And, for text manipulation, you can actually achieve better performances with Perl than with C, unless you take time to write complex algorithms. Of course, for simple stuff C can be unbeatable.

That said, if your "bash" script is not looping, just calling other programs, then there isn't any gain to be had. For example, if your script looks like "cat X | grep Y | tr -f 3-5 | sort | uniq", then most of the time is spent on cat, grep, tr, sort and uniq, NOT on Bash.

You'll gain performance if there is any loop in the script, or if you save multiple reads of the same file.

You say you cut stuff between two timestamps on a file. Let's say your Bash script looks like this:

LINE1=`grep -n TIMESTAMP1 filename | head -1 | cut -d ':' -f 1`
LINE2=`grep -n TIMESTAMP2 filename | head -1 | cut -d ':' -f 1`
tail +$LINE1 filename | head -$(($LINE2-$LINE1))

Then you'll gain performance, because you are reading the whole file three times: once for each command where "filename" appears. In Perl, you would do something like this:

my $state = 0;
while(<>) {
  exit if /TIMESTAMP2/;
  print $_ if $state == 1;
  $state = 1 if /TIMESTAMP1/;
}

This will read the file only once and will also stop once you read TIMESTAMP2. Since you are processing multiple files, you'd use "last" or "break" instead of "exit", so that the script can continue to process the files.

Anyway, seeing your script I'm positive you'll gain a lot by rewriting it in Perl. Notwithstanding the loops dealing with file names (whose speed WILL be improved, but is probably insignificant), for each file which is not fully inside or outside scope you do:

  1. Read the WHOLE file to count lines!
  2. Do multiple tails on the file
  3. Finish by "head" or "tail" the file once again

Furthermore, head your tails. Each time you do that, some piece of code is reading that data. Some of those lines are being read up to 10 times or more!

Daniel
By the way, that's an unnecessary use of cat.
Dennis Williamson
Personally, I do this 'unnecessary' use of cat all the time for style reasons - I like having the data always flow from left to right, not right to left then back to the right as in grep Y < X | foo > Z
bdonlan
Cat is very light-weight, so removing it is premature optimization. Using cat have the added advantage of making it easier to change the orderer of the following steps. Also, it can more easily be replaced with gzcat, if the need arises. Finally, who cares? I wasn't suggesting it.
Daniel
http://blog.jrock.us/articles/Useless%20use%20of%20%22useless%20use%22.pod
jrockway
@Bdonlan: you can solve the left to right issue by writing `< X grep Y | foo > Z`. This especially helps if you think you might need to redo the grep command alone a few times (`< file grep item`), and you want to have your cursor at the point you're going to edit when you call the previous command back up with an arrow key. See here: http://sial.org/howto/shell/useless-cat/
Telemachus
+2  A: 

I would profile all three solutions and pick which is best in terms of initial startup speed, processing speed, and memory usage.

Something like Perl/Python/Ruby may not be the absolute fastest, but you can rapidly develop in those languages - much faster than in C and even Bash.

Nick Presta
+1  A: 

it depends on how your bash script is written. if you are not using awk to parse the log file, instead using bash's while read loop, then changing it to awk will improve the speed.

ghostdog74
+1  A: 

bash actually reads the file a line at a time as it interprets it on the fly (which you'll be made painfully aware of if you ever modify a bash script while it's still running), rather than preloading and parsing it all at once. So yeah, Perl will generally be a lot faster if you're doing anything that you wouldn't normally do in bash anyways.

David
+6  A: 

Updated script based on Brent's comment: This one is untested.

#!/usr/bin/perl

use strict;
use warnings;

my %months = (
    jan => 1, feb => 2,  mar => 3,  apr => 4,
    may => 5, jun => 6,  jul => 7,  aug => 8,
    sep => 9, oct => 10, nov => 11, dec => 12,
);

while ( my $line = <> ) {
    my $ts = substr $line, 0, 15;
    next if parse_date($ts) lt '0201100543';
    last if parse_date($ts) gt '0715123456';
    print $line;
}

sub parse_date {
    my ($month, $day, $time) = split ' ', $_[0];
    my ($hour, $min, $sec) = split /:/, $time;
    return sprintf(
        '%2.2d%2.2d%2.2d%2.2d%2.2d',
        $months{lc $month}, $day,
        $hour, $min, $sec,
    );
}


__END__

Previous answer for reference: What is the format of the file? Here is a short script which assumes the first column is a timestamp and prints only lines that have timestamps in a certain range. It also assumes that the timestamps are sorted. On my system, it took about a second to filter 900,000 lines out of a million:

#!/usr/bin/perl

use strict;
use warnings;

while ( <> ) {
    my ($ts) = split;
    next if $ts < 1247672719;
    last if $ts > 1252172093;
    print $ts, "\n";
}

__END__
Sinan Ünür
This is the idea - except that the logs use dates of the format: Jul 15 00:03:19
Brent
@Brent is there a year somewhere in that date?
Sinan Ünür
No, I guess it assumes the current year. In my case this might only be a problem on Jan 1 :)
Brent
By the way, the files are chronologically sorted, starting with the above formatted time, and rotated daily at close to midnight.
Brent
Thanks, I'll compare this to my bash script and let you know...
Brent
@Brad Gilbert Thank you for the edit but I am going to roll it back. The first argument to `split` is always a pattern except in the **special case** of `split ' '`.
Sinan Ünür
Thanks - I used the basic code above and the results are stunningly faster. From the other comments I realize that my head and tail commands - which I assumed could start reading at a specific place in the file - can't, and were the main source of the slowness.
Brent
@Brent thank you for accepting my answer despite my initial negative reaction.
Sinan Ünür
+1  A: 

I agree that moving from a bash-only script to Perl (or even awk if a perl environment is not readily available) could yield a speed benefit, assuming both are equally well written.

However, if the extract was amenable to being formed by a bash script that creates parameters for and then calls grep with a regex then that could be faster than a 'pure' script.

mas
+5  A: 

Based on the shell code you have, with multiple calls to tail/head, I'd say absolutely Perl could be faster. C could be even faster, but the development time probably won't be worth it, so I'd stick to Perl. (I say "could" because you can write shell scripts in Perl, and I've seen enough of those to cringe. That obviously wouldn't have the speed benefit that you want.)

Perl has a higher startup cost, or so it's claimed. Honestly, I've never noticed. If your alternative is to do it in Java, Perl has no startup cost. Compared to Bash, I simply haven't noticed. What I have noticed is that as I get away from calling all the specialised Unix tools, which are great when you don't have alternatives, and get toward doing it all in a single process, speed goes up. The overhead of creating new processes on Unix isn't as severe as it may have been on Windows, but it's still not entirely negligible as you have to reinitialise the C runtime library (libC) each time, parse arguments, open files (perhaps), etc. In Perl, you end up using vast swaths of memory as you pass everything around in a list or something, but it is all in memory, so it's faster. And many of the tools you're used to are either built in (map/grep, regexes) or are available in modules on CPAN. A good combination of these would get the job done easily.

The big thing is to avoid re-reading files. It's costly. And you're doing it many times. Heck, you could use the :gzip modifier on open to read your gzip files directly, saving yet another pass - and this would be faster in that you'd be reading less from disk.

Tanktalus
A: 

In your bash script, put this:

perl -ne "print if /$FROM/../$TO/" $LOGFILES

$FROM and $TO are really regex to your start and end time.

They are inclusive, so you might want to put 2009-06-14 23:59:59 for your end time, since 2009-06-15 00:00:00 will include transactions at midnight.

Mathieu Longtin
A: 

Well, bash is intepreted line by line as it runs and depends on calling a lot of external progs (depending on what you want to do).You often have to use temp files as intermediate storage for result sets. It (shell) was originally designed to talk to the system and automate cmd sequences (shell files).

Perl is more like C, it's largely self contained with a huge library of free code and it's compiled , so it runs much faster, eg about 80-90% speed of C, but easier to program (eg variable sizes are dynamic).

joe