views:

71

answers:

3

I am parsing a log and inserting it into either MySQL or SQLite using SQLAlchemy and Python. Right now I open a connection to the DB, and as I loop over each line, I insert it after it is parsed (This is just one big table right now, not very experienced with SQL). I then close the connection when the loop is done. The summarized code is:

log_table = schema.Table('log_table', metadata,
                         schema.Column('id', types.Integer, primary_key=True),
                         schema.Column('time', types.DateTime),
                         schema.Column('ip', types.String(length=15))
....
engine = create_engine(...)
metadata.bind = engine
connection = engine.connect()
....
for line in file_to_parse:
    m = line_regex.match(line)
    if m:
        fields = m.groupdict()
        pythonified = pythoninfy_log(fields) #Turn them into ints, datatimes, etc
        if use_sql:
            ins = log_table.insert(values=pythonified)
            connection.execute(ins)
            parsed += 1

My two questions are:

  • Is there a way to speed up the inserts within this basic framework? Maybe have a Queue of inserts and some insertion threads, some sort of bulk inserts, etc?
  • When I used MySQL, for about ~1.2 million records the insert time was 15 minutes. With SQLite, the insert time was a little over an hour. Does that time difference between the db engines seem about right, or does it mean I am doing something very wrong?
+2  A: 

Without knowing the table engine (MyISAM? InnoDB?), schema, and indexes, it's hard to comment on specifics between the two databases you're using there.

However, when using MySQL like this, you will likely find that it is far faster to write your data out to a temporary text file and then use the LOAD DATA INFILE syntax to load it all into your database. It looks like you can call the execute method on your connection object to run the SQL necessary to do this.

Further, if you are dead set on adding things row by row, and you're recreating the table every time, you can verify key constraints in your program and add those constraints only after all rows have been inserted, saving the DB the time of doing constraints checks on every insert.

Conspicuous Compiler
"you can verify key constraints in your program and add those constraints only after all rows have been inserted, saving the DB the time of doing constraints checks on every insert." Can you break that down a little for me, that part went over my head :-P
Kyle Brandt
@Kyle It's hard to give specifics without a table schema to work from. But, for example, if you have any UNIQUE indexes, that uniqueness is a constraint on the table. Every time you insert a row, the database insures that there isn't another row which conflicts with that one. Since you have only one table, you don't have to worry about foreign key constraints, but if you should add those later, this would apply to them as well.
Conspicuous Compiler
+2  A: 

The big thing you should try is putting a transaction around multiple inserts since it is the committing of the database to disk that really takes a long time. You'll need to decide the batching level, but a crude first attempt would be to wrap a transaction around the whole lot.

Donal Fellows
So something like an array of the ins objects I create, and then execute when the array is full? Or is that not really what you mean?
Kyle Brandt
@Kyle: You'll need to create a transaction with transaction = session.create_transaction(); and then later do a transaction.commit . See http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html and scroll down to the "Transactions" heading.
Conspicuous Compiler
Not actually what I did, but this was the idea behind what I did and like to give people points :-). Explained what I did below.
Kyle Brandt
A: 

I did the following to achieve some batching:

inserts = []
insert_every = 1000
for line in file_to_parse:
    m = line_regex.match(line)
    if m:
        fields = m.groupdict()
        if use_sql: #This uses Globals, Ick :-/
            inserts.append(pythonified)
            if (parsed % insert_every) == 0:
                connection.execute(log_table.insert(), inserts)
                inserts = []
            parsed += 1
if use_sql:
    if len(inserts) > 0:
        connection.execute(log_table.insert(), inserts)

This doesn't use transactions, but in a very lazy manner it allowed me to turn the insert/parse stage from ~13 seconds to about ~2 seconds with mysql backend using a smaller sample. I will see what the difference between mysql and sqlite is now with this change using the full sample.

I found the basic information for this here.

Results:
Engine:Non-Grouped Insert Time in Minutes: Grouped Insert Time in Minutes
Sqlite:61:8
MySql:15:2.5

I didn't flush my cache between the mysql and sqlite which would have had the source text file possibly, but I don't think that would be a relatively significant difference.

Kyle Brandt