views:

66

answers:

3

I have a VPS that's hosting multiple virtual hosts. Each host has it's own access.log and error.log. Currently, there's no log rotation setup, though, this may change.

Basically, I want to parse these logs to monitor bandwidth and collect stats.

My idea was to write a parser and save the information to a small sqlite database. The script will run every 5 minutes and use Python's seek and tell methods to open the log files from the last parsed locations. This prevents me from parsing a 10GB log file every 5 minutes when all I need is the new information sitting at the end of it (no log rotation, remember?).

After some thought, I realised that all I'm doing is taking the information from the log files and putting them into a database... Moving the data from one location to another :/

So how else can I do this? I want to be able to do something like:

python logparse.py --show=bandwidth --between-dates=25,05|30,05 --vhost=test.com

This would open the log file for test.com and show me the bandwidth used for the specified 5 days.

Now, my question is, how do I prevent myself from parsing 10GB worth of data when I only want 5 days worth of data?

If I were to use my idea of saving the log data to a database every 5 minutes, I could just save a unix timestamp of the dates and pull out the data between them. Easy. But I'd prefer to parse the log file directly.

A: 

Save the last position

When you have finished with the parsing of a log file, save the position in a table of your database that reference both the full file path and the position. When you run the parser 5 minutes after, you query the database for the log your are going to parse, retrieve the position and start from there.

Save the first line of data

When you have log rotation, add an additionnal key in the database that will contain the first line of the log file. So when you start with a file, first read the first line. When you query the database, you have then to check on the first line and not on the file name.

First line should be unique, always, since you have the timestamp. But don't forget that W3C compliant log file usually write headers at the beginning of the file. So the first line should be the first line of data.

Save the data you need only

When parsing W3C, it's very easy to read the bytes sent. Parsing will be very fast if you keep that information only. The store it in your database, either by updating an existing row in your database, or adding a new row with a timestamp that you can aggregate with others later in a query.

Don't reinvent the wheel

Unless what you are doing is very specific, I recommand you to grab an open source parser on the web. http://awstats.sourceforge.net/

Pierre 303
+1  A: 

Unless you create different log files for each day, you have no way other than to parse on request the whole log.

I would still use a database to hold the log data, but with your desired time-unit resolution (eg. hold the bandwidth at a day / hour interval). Another advantage in using a database is that you can make range queries, like the one you give in your example, very easily and fast. Whenever you have old data that you don't need any more you can delete it from the database to save up space.

Also, you don't need to parse the whole file each time. You could monitor the writes to the file with the help of pyinotify whenever a line is written you could update the counters in the database. Or you can store the last position in the file whenever you read from it and read from that position the next time. Be careful when the file is truncated.

To sum it up:

  • hold your data in the database at day resolution (eg. the bandwith for each day)
  • use pyinotify to monitor the writes to the log file so that you don't read the whole file over and over again

If you don't want to code your own solution, take a look at Webalizer, AWStats or pick a tool from this list.

EDIT:

WebLog Expert also looks promising. Take a look of one of the reports.

the_void
+1  A: 

Pulling just the required 5 days of data from a large logfile comes down to finding the right starting offset to seek() the file to before you begin parsing.

You could find that position each time using a binary search through the file: seek() to os.stat(filename).st_size / 2, call readline() once (discarding the result) to skip to the end of the current line, then do two more readline()s. If the first of those lines is before your desired starting time, and the second is after it, then your starting offset is tell() - len(second_line). Otherwise, do the standard binary search algorithm. (I'm ignoring the corner cases where the line you're looking for is the first or last or not in the file at all, but those are easy to add)

Once you have your starting offset, you just keep parsing lines from there until you reach one that's newer than the range you're interested in.

This will be much faster than parsing the whole logfile each time, of course, but if you're going to be doing a lot of these queries, then a database probably is worth the extra complexity. If the size of the database is a concern, you could go for a hybrid approach where the database is an index to the log file. For example, you could store the just the byte-offset of the start of each day in the database. If you don't want to update the database every 5 minutes, you could have logparse.py update it with new data each time it runs.

After all that, though, as Pierre and the_void have said, do make sure you're not reinventing the wheel -- you're not the first person ever to need bandwidth statistics :-)

slowdog