views:

223

answers:

4

I'm writing a loganalysis application and wanted to grab apache log records between two certain dates. Assume that a date is formated as such: 22/Dec/2009:00:19 (day/month/year:hour:minute)

Currently, I'm using a regular expression to replace the month name with its numeric value, remove the separators, so the above date is converted to: 221220090019 making a date comparison trivial.. but..

Running a regex on each record for large files, say, one containing a quarter million records, is extremely costly.. is there any other method not involving regex substitution?

Thanks in advance

Edit: here's the function doing the convertion/comparison

function dateInRange(t, from, to) {
    sub(/[[]/, "", t);
    split(t, a, "[/:]");
    match("JanFebMarAprMayJunJulAugSepOctNovDec", a[2]);
    a[2] = sprintf("%02d", (RSTART + 2) / 3);
    s = a[3] a[2] a[1] a[4] a[5];

    return s >= from && s <= to;
}

"from" and "to" are the intervals in the aforementioned format, and "t" is the raw apache log date/time field (e.g [22/Dec/2009:00:19:36)

+1  A: 

I once had the same problem of a very slow AWK program that involved regular expressions. When I translated the whole program to Perl, it ran at much greater speed. I guess it was because GNU AWK compiles a regular expression every time it interprets the expression, where perl just compiles the expression once.

Roland Illig
Seconded. Awk is great fun, but Perl is a lot faster. For just analyzing log files, consider any of the special-purpose and *fast* tools for the purpose, like Analog.
shavenwarthog
True, Perl is faster, but I chose AWK because of convenience: it's older than the sun and has been shipping with every *nix, so I can safely assume it's installed. Perl; not so sure.If am distributing an application, I'd have to go with the lowest denominator.. besides, Perl is no match to AWK in terms of simplicity and brevity when processing textual data (mind you, my Perl powers are limited)
smallmeans
A: 

Well, here is an idea, assuming records in a log are ordered by date.

Instead of running regexp on every line in a file and checking if that record is within the required range, do a binary search.

Get total number of lines in a file. Read a line from the middle and check its date. If it is older than your range - then anything before that line can be ignored. Split what's left in half and check a line from the middle again. And so on until you find your range boundaries.

serg
+1  A: 

Here is a Python program I wrote to do a binary search through a log file based on dates. It could be adapted to work for your use.

It seeks to the middle of the file then syncs to a newline, reads and compares the date, repeats the process splitting the previous half in half, doing that until the date matches (greater than or equal), rewinds to make sure there's no more with the same date right before, then reads and outputs lines until the end of the desired range. It's very fast.

I have a more advanced version in the works. Eventually I'll get it completed and post the updated version.

Dennis Williamson
A: 

Chopping files just to identify a range sounds a bit heavy handed for such a simple task (binary search is worth considering, though)

here's my modified function, which is obviously much faster since the regex is eliminated

BEGIN {
    months["Jan"] = 1
    months["Feb"] = 2
    ....
    months["Dec"] = 12
}
function dateInRange(t, from, to) {
    split(t, a, "[/:]");
    m = sprintf("%02d", months[a[2]]);
    s = a[3] m a[1] a[4] a[5];
    ok = s >= from && s <= to;
    if(!ok && seen == 1){exit;}
    return ok;
}

An array is defined and subsequently used to index months. It's ensured that the program doesnt continue checking records once date is out of the range (variable seen is set on first match)

Thank you all for your inputs.

smallmeans