ansaurus

Question

AWK: compare apache dates without using regular expression

Answer 1

+1 A:

I once had the same problem of a very slow AWK program that involved regular expressions. When I translated the whole program to Perl, it ran at much greater speed. I guess it was because GNU AWK compiles a regular expression every time it interprets the expression, where perl just compiles the expression once.

Roland Illig 2010-05-15 22:24:04

Seconded. Awk is great fun, but Perl is a lot faster. For just analyzing log files, consider any of the special-purpose and *fast* tools for the purpose, like Analog.

shavenwarthog 2010-05-15 23:17:06

True, Perl is faster, but I chose AWK because of convenience: it's older than the sun and has been shipping with every *nix, so I can safely assume it's installed. Perl; not so sure.If am distributing an application, I'd have to go with the lowest denominator.. besides, Perl is no match to AWK in terms of simplicity and brevity when processing textual data (mind you, my Perl powers are limited)

smallmeans 2010-05-16 21:48:26

Answer 2

A:

Well, here is an idea, assuming records in a log are ordered by date.

Instead of running regexp on every line in a file and checking if that record is within the required range, do a binary search.

Get total number of lines in a file. Read a line from the middle and check its date. If it is older than your range - then anything before that line can be ignored. Split what's left in half and check a line from the middle again. And so on until you find your range boundaries.

serg 2010-05-15 22:31:00

Answer 3

+1 A:

Here is a Python program I wrote to do a binary search through a log file based on dates. It could be adapted to work for your use.

It seeks to the middle of the file then syncs to a newline, reads and compares the date, repeats the process splitting the previous half in half, doing that until the date matches (greater than or equal), rewinds to make sure there's no more with the same date right before, then reads and outputs lines until the end of the desired range. It's very fast.

I have a more advanced version in the works. Eventually I'll get it completed and post the updated version.

Dennis Williamson 2010-05-15 22:50:11

Answer 4

A:

Chopping files just to identify a range sounds a bit heavy handed for such a simple task (binary search is worth considering, though)

here's my modified function, which is obviously much faster since the regex is eliminated

BEGIN {
    months["Jan"] = 1
    months["Feb"] = 2
    ....
    months["Dec"] = 12
}
function dateInRange(t, from, to) {
    split(t, a, "[/:]");
    m = sprintf("%02d", months[a[2]]);
    s = a[3] m a[1] a[4] a[5];
    ok = s >= from && s <= to;
    if(!ok && seen == 1){exit;}
    return ok;
}

An array is defined and subsequently used to index months. It's ensured that the program doesnt continue checking records once date is out of the range (variable seen is set on first match)

Thank you all for your inputs.

smallmeans 2010-05-16 22:08:27

ansaurus

tags:

views:

answers:

AWK: compare apache dates without using regular expression

related questions