ansaurus

Question

Need some ideas on how to code my log parser...

Answer 1

A:

Save the last position

When you have finished with the parsing of a log file, save the position in a table of your database that reference both the full file path and the position. When you run the parser 5 minutes after, you query the database for the log your are going to parse, retrieve the position and start from there.

Save the first line of data

When you have log rotation, add an additionnal key in the database that will contain the first line of the log file. So when you start with a file, first read the first line. When you query the database, you have then to check on the first line and not on the file name.

First line should be unique, always, since you have the timestamp. But don't forget that W3C compliant log file usually write headers at the beginning of the file. So the first line should be the first line of data.

Save the data you need only

When parsing W3C, it's very easy to read the bytes sent. Parsing will be very fast if you keep that information only. The store it in your database, either by updating an existing row in your database, or adding a new row with a timestamp that you can aggregate with others later in a query.

Don't reinvent the wheel

Unless what you are doing is very specific, I recommand you to grab an open source parser on the web. http://awstats.sourceforge.net/

Pierre 303 2010-07-25 09:43:00

Answer 2

+1 A:

Unless you create different log files for each day, you have no way other than to parse on request the whole log.

I would still use a database to hold the log data, but with your desired time-unit resolution (eg. hold the bandwidth at a day / hour interval). Another advantage in using a database is that you can make range queries, like the one you give in your example, very easily and fast. Whenever you have old data that you don't need any more you can delete it from the database to save up space.

Also, you don't need to parse the whole file each time. You could monitor the writes to the file with the help of pyinotify whenever a line is written you could update the counters in the database. Or you can store the last position in the file whenever you read from it and read from that position the next time. Be careful when the file is truncated.

To sum it up:

hold your data in the database at day resolution (eg. the bandwith for each day)
use pyinotify to monitor the writes to the log file so that you don't read the whole file over and over again

If you don't want to code your own solution, take a look at Webalizer, AWStats or pick a tool from this list.

EDIT:

WebLog Expert also looks promising. Take a look of one of the reports.

the_void 2010-07-25 09:48:58

Answer 3

+1 A:

Pulling just the required 5 days of data from a large logfile comes down to finding the right starting offset to seek() the file to before you begin parsing.

You could find that position each time using a binary search through the file: seek() to os.stat(filename).st_size / 2, call readline() once (discarding the result) to skip to the end of the current line, then do two more readline()s. If the first of those lines is before your desired starting time, and the second is after it, then your starting offset is tell() - len(second_line). Otherwise, do the standard binary search algorithm. (I'm ignoring the corner cases where the line you're looking for is the first or last or not in the file at all, but those are easy to add)

Once you have your starting offset, you just keep parsing lines from there until you reach one that's newer than the range you're interested in.

This will be much faster than parsing the whole logfile each time, of course, but if you're going to be doing a lot of these queries, then a database probably is worth the extra complexity. If the size of the database is a concern, you could go for a hybrid approach where the database is an index to the log file. For example, you could store the just the byte-offset of the start of each day in the database. If you don't want to update the database every 5 minutes, you could have logparse.py update it with new data each time it runs.

After all that, though, as Pierre and the_void have said, do make sure you're not reinventing the wheel -- you're not the first person ever to need bandwidth statistics :-)

slowdog 2010-07-25 10:39:04

ansaurus

tags:

views:

answers:

Need some ideas on how to code my log parser...

related questions