I have a VPS that's hosting multiple virtual hosts. Each host has it's own access.log and error.log. Currently, there's no log rotation setup, though, this may change.
Basically, I want to parse these logs to monitor bandwidth and collect stats.
My idea was to write a parser and save the information to a small sqlite database. The script will run every 5 minutes and use Python's seek
and tell
methods to open the log files from the last parsed locations. This prevents me from parsing a 10GB log file every 5 minutes when all I need is the new information sitting at the end of it (no log rotation, remember?).
After some thought, I realised that all I'm doing is taking the information from the log files and putting them into a database... Moving the data from one location to another :/
So how else can I do this? I want to be able to do something like:
python logparse.py --show=bandwidth --between-dates=25,05|30,05 --vhost=test.com
This would open the log file for test.com and show me the bandwidth used for the specified 5 days.
Now, my question is, how do I prevent myself from parsing 10GB worth of data when I only want 5 days worth of data?
If I were to use my idea of saving the log data to a database every 5 minutes, I could just save a unix timestamp of the dates and pull out the data between them. Easy. But I'd prefer to parse the log file directly.