views:

256

answers:

2

So, I want to put the past year or so of Tomcat Combined Format files into a database. There are zillions and zillions of hits.

(The plan is to run bespoke and ad hoc queries against it, and match up with some other data. We have some questions that existing log analysis products out there cannot answer for us.)

What I am looking for is...

  1. a robust tool to cleanly import my file into the database
  2. an existing (typed, thought out, bug free) schema to store and structure

I'm half a step from rolling my own, but this seems like something that has done before--zillions of times.

+1  A: 

I would just write the script..

It might have been written countless times before, but I doubt it will have been for the right database, or for your specific log configuration (Not sure about the W3C Extended Log Format, but with many others you can define a custom formatting)

Looking at log format doc, it should be pretty trivial to take each field ad create a column in the DB for it..

Then, to parse the example log from the log-format doc:

#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html
12:45:52 GET /foo/bar.html
12:57:34 GET /foo/bar.html

..the following script will work fine, which only took a few minutes to write:

import re
import sys

mr = re.compile("^(\d\d:\d\d:\d\d) ([A-Z]+) (.+)$")

def insert_into_database(time, rtype, uri):
    print "INSERT INTO database (%s, %s, %s)" % (time, rtype, uri)

for line in open("logfile.log").readlines():
    m = mr.match(line)
    if not m:
        sys.stderr.write("Invalid line: %s\n" % line.strip())
    else:
        insert_into_database(m.group(1), m.group(2), m.group(3))

May not be the most robust/reliable script ever, but it works (well, aside from the insert_into_database function!)

dbr
http://www.w3.org/TR/WD-logfile is the first Google result for "W3C extended". and dbr's code handles the example on that page
Osama ALASSIRY
i stand corrected. have edited question to get what i want. But there is still no typed database schema...
Stu Thompson
+1  A: 

This should start you off in the right direction:

Writing Apache's Logs to MySQL http://onlamp.com/pub/a/apache/2005/02/10/database%5Flogs.html

Pretty easy to adapt to another database, or customize the schema. There isn't much to the schema really - just a plain table will do with the appropriate fields and indexes for searching efficiently.

Kristoffon
Apache Tomcat is not Apache httpd. Big difference. And, even if it is a single table schema, the precise data types and sizes is important when talking about hundreds of millions of rows.
Stu Thompson
@Stu: Tomcat's "combined" format is the same as the httpd "combined" format.
Stobor
Point is, this page is very much about using mod_log_sql, which won't work in Tomcat. Unless you are only suggesting that I pay attention the schema, but from your answer it does not seem that way.
Stu Thompson