ansaurus

Question

How to use SQLAlchemy to dump an SQL file from query expressions to bulk-insert into a DBMS?

Answer 1

+1 A:

Ordinarily, no, there's no way to get the query with the values included.

What database are you using though? Cause a lot of databases do have some bulk load feature for CSV available.

Postgres: http://www.postgresql.org/docs/8.4/static/sql-copy.html
MySQL: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Oracle: http://www.orafaq.com/wiki/SQL*Loader_FAQ

If you're willing to accept that certain values might not be escaped correctly than you can use this hack I wrote for debugging purposes:

'''Replace the parameter placeholders with values'''
params = compiler.params.items()
params.sort(key=lambda (k, v): len(str(k)), reverse=True)
for k, v in params:
    '''Some types don't need escaping'''
    if isinstance(v, (int, long, float, bool)):
        v = unicode(v)
    else:
        v = "'%s'" % v

    '''Replace the placeholders with values
    Works both with :1 and %(foo)s type placeholders'''
    query = query.replace(':%s' % k, v)
    query = query.replace('%%(%s)s' % k, v)

WoLpH 2010-06-12 15:21:44

Answer 2

A:

I would venture to say the time spent in the python script is in the per-record upload portion. To determine this you could write to CSV or discard the results instead of uploading new records. This will determine where the bottleneck is; at least from a lookup-vs-insert standpoint. If, as I suspect, that is indeed where it is you can take advantage of the bulk import feature most DBS have. There is no reason, and indeed some arguments against, inserting record-by-record in this kind of circumstance.

Bulk imports tend to do some interestng optimization such as doing it as one transaction w/o commits for each record (even just doing this could see an appreciable drop in run time); whenever feasible I recommend the bulk insert for large record counts. You could still use the producer/consumer approach, but have the consumer instead store the values in memory or in a file and then call the bulk import statement specific to the DB you are using. This might be the route to go if you need to do processing for each record in the CSV file. If so I would also consider how much of that can be cached and shared between records.

it is also possible that the bottleneck is using SQLAlchemy. Not that I know of any inherent issues, but given what you are doing it might be requiring a lot more processing than is necessary - as evidenced by the 8x difference in run times.

For fun, since you already know the SQL, try using a direct DBAPI module in Python to do it and compare run times.

The Real Bill 2010-07-07 16:21:25

Answer 3

+1 A:

First, unless you actually have a machine with 50 CPU cores, using 50 threads/processes won't help performance -- it will actually make things slower.

Second, I've a feeling that if you used SQLAlchemy's way of inserting multiple values at once, it would be much faster than creating ORM objects and persisting them one-by-one.

Marius Gedminas 2010-07-10 12:16:30

ansaurus

tags:

views:

answers:

How to use SQLAlchemy to dump an SQL file from query expressions to bulk-insert into a DBMS?

related questions