views:

3413

answers:

4

I have a set of mail logs: mail.log mail.log.0 mail.log.1.gz mail.log.2.gz

each of these files contain chronologically sorted lines that begin with timestamps like:

May 3 13:21:12 ...

How can I easily grab every log entry after a certain date/time and before another date/time using bash (and related command line tools) without comparing every single line? Keep in mind that my before and after dates may not exactly match any entries in the logfiles.

It seems to me that I need to determine the offset of the first line greater than the starting timestamp, and the offset of the last line less than the ending timestamp, and cut that section out somehow.

A: 

It may be possible in a Bash environment but you should really take advantage of tools that have more built-in support for working with Strings and Dates. For instance Ruby seems to have the built in ability to parse your Date format. It can then convert it to an easily comparable Unix Timestamp (a positive integer representing the seconds since the epoch).

irb> require 'time'
# => true

irb> Time.parse("May 3 13:21:12").to_i
# => 1241371272

You can then easily write a Ruby script:

  • Provide a start and end date. Convert those to this Unix Timestamp Number.
  • Scan the log files line by line, converting the Date into its Unix Timestamp and check if that is in the range of the start and end dates.

Note: Converting to a Unix Timestamp integer first is nice because comparing integers is very easy and efficient to do.

You mentioned "without comparing every single line." Its going to be hard to "guess" at where in the log file the entries starts being too old, or too new without checking all the values in between. However, if there is indeed a monotonically increasing trend, then you know immediately when to stop parsing lines, because as soon as the next entry is too new (or old, depending on the layout of the data) you know you can stop searching. Still, there is the problem of finding the first line in your desired range.


I just noticed your edit. Here is what I would say:

If you are really worried about efficiently finding that start and end entry, then you could do a binary search for each. Or, if that seems like overkill or too difficult with bash tools you could have a heuristic of reading only 5% of the lines (1 in every 20), to quickly get a close to exact answer and then refining that if desired. These are just some suggestions for performance improvements.

Joseph Pecoraro
I can do the same thing with the date command (and simpler), but it will take forever if each line needs to be examined.One idea is checking the first and last line of each logfile, and ignoring those outside the range. (for instance)
Brent
The man page for `date` says "date -- display or set date and time". I would be interested to see how you could read a file and convert "May 3 13:21:12" to a Unix Timestamp with the date command.
Joseph Pecoraro
date -d "May 3 13:32:38" +%s
Brent
I would think running a new date process every single time you come across a date string would be far slower then just using the languages built-in support. But I've never time trial'd this.
Joseph Pecoraro
+1  A: 

You have to look at every single line in the range you want (to tell if it's in the range you want) so I'm guessing you mean not every line in the file. At a bare minimum, you will have to look at every line in the file up to and including the first one outside your range (I'm assuming the lines are in date/time order).

This is a fairly simple pattern:

state = preprint
for every line in file:
    if line.date >= startdate:
        state = print
    else
    if line.date > enddate:
        exit for loop
    if state == print:
        print line

You can write this in awk, Perl, Python, even COBOL if you must but the logic is always the same.

Locating the line numbers first (with say grep) and then just blindly printing out that line range won't help since grep also has to look at all the lines (all of them, not just up to the first outside the range, and most likely twice, one for the first line and one for the last).

If this is something you're going to do quite often, you may want to consider shifting the effort from 'every time you do it' to 'once, when the file is stabilized'. An example would be to load up the log file lines into a database, indexed by the date/time.

That takes a while to get set up but will result in your queries becoming a lot faster. I'm not necessarily advocating a database - you could probably achieve the same effect by splitting the log files into hourly logs thus:

2009/
  01/
    01/
      0000.log
      0100.log
      : :
      2300.log
    02/
    : :

Then for a given time, you know exactly where to start and stop looking. The range 2009/01/01-15:22 through 2009/01/05-09:07 would result in:

  • some (the last bit) of the file 2009/01/01/1500.txt.
  • all of the files 2009/01/01/1[6-9]*.txt.
  • all of the files 2009/01/01/2*.txt.
  • all of the files 2009/01/0[2-4]/*.txt.
  • all of the files 2009/01/05/0[0-8]*.txt.
  • some (the first bit) of the file 2009/01/05/0900.txt.

Of course, I'd write a script to return those lines rather than trying to do it manually each time.

paxdiablo
No, I'm pretty sure you don't need to look at every single line. What about examining the middle entry of the file first, and rejecting half the file right off the bat?
Brent
Actually, I'm pretty sure you do, since you wanted to use standard UNIX tools - I'm not aware of ANY that do what you describe (i.e., fseek to the middle and find the nearest line). You could write one yourself but, if you're doing that, all sorts of new possibilities open up.
paxdiablo
+4  A: 

Convert your min/max dates into "seconds since epoch",

MIN=`date --date="$1" +%s`
MAX=`date --date="$2" +%s`

Convert the first n words in each log line to the same,

L_DATE=`echo $LINE | awk '{print $1 $2 ... $n}'`
L_DATE=`date --date="$L_DATE" +%s`

Compare and throw away lines until you reach MIN,

if (( $MIN > $L_DATE )) ; then continue ; fi

Compare and print lines until you reach MAX,

if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi

Exit when you exceed MAX.

if (( $L_DATE > $MAX )) ; then exit 0 ; fi

The whole script minmaxlog.sh looks like this,

#!/usr/bin/env bash

MIN=`date --date="$1" +%s`
MAX=`date --date="$2" +%s`

while true ; do
    read LINE
    if [ "$LINE" = "" ] ; then break ; fi

    L_DATE=`echo $LINE | awk '{print $1 " " $2 " " $3 " " $4}'`
    L_DATE=`date --date="$L_DATE" +%s`

    if (( $MIN > $L_DATE  )) ; then continue ; fi
    if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi
    if (( $L_DATE >  $MAX )) ; then break ; fi
done

I ran it on this file minmaxlog.input,

May 5 12:23:45 2009 first line
May 6 12:23:45 2009 second line
May 7 12:23:45 2009 third line
May 9 12:23:45 2009 fourth line
June 1 12:23:45 2009 fifth line
June 3 12:23:45 2009 sixth line

like this,

./minmaxlog.sh "May 6" "May 8" < minmaxlog.input
Dylan
I really don't think that bash is the way to go for efficiency, though. This is extremely simple, yes, but if you want it to be fast consider using C with lseek( ) and a binary search.
Dylan
Why care for performance if you are only going to do it once or twice?
Daniel
@Daniel, where was it said that this was only going to be done once or twice?
paxdiablo
+1  A: 

Here one basic idea of how to do it:

  1. Examine the datestamp on the file to see if it is irrelevent
  2. If it could be relevent, unzip if necessary and examine the first and last lines of the file to see if it contains the start or finish time.
  3. If it does, use a recursive function to determine if it contains the start time in the first or second half of the file. Using a recursive function I think you could find any date in a million line logfile with around 20 comparisons.
  4. echo the logfile(s) in order from the offset of the first entry to the offset of the last entry (no more comparisons)

What I don't know is: how to best read the nth line of a file (how efficient is it to use tail n+n**|head 1**?)

Any help?

Brent
tail (or head for that matter) needs to count the newline characters which means it'll have to count in from either end to find a middle line. And there's no standard UNIX command that will binary search a file like in your description - you'll have to write your own, in which case you might as well do the whole lot in a single optimized executable.
paxdiablo
Reef