views:

121

answers:

1

Is there any way to get sqlalchemy to do a bulk insert rather than inserting each individual object. i.e.,

doing:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

rather than:

INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)

I've just converted some code to use sqlalchemy rather than raw sql and although it is now much nicer to work with it seems to be slower now (up to a factor of 10), I'm wondering if this is the reason.

May be I could improve the situation using sessions more efficiently. At the moment I have autoCommit=False and do a session.commit() after I've added some stuff. Although this seems to cause the data to go stale if the DB is changed elsewhere, like even if I do a new query I still get old results back?

Thanks for your help!

+2  A: 

As far as I know, there is no way to get the ORM to issue bulk inserts. I believe the underlying reason is that SQLAlchemy needs to keep track of each object's identity (i.e., new primary keys), and bulk inserts interfere with that. For example, assuming your foo table contains an id column and is mapped to a Foo class:

x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1

Since SQLAlchemy picked up the value for x.id without issuing another query, we can infer that it got the value directly from the INSERT statement. If you don't need subsequent access to the created objects via the same instances, you can skip the ORM layer for your insert:

Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))

SQLAlchemy can't match these new rows with any existing objects, so you'll have to query them anew to for any subsequent operations.

As far as stale data is concerned, it's helpful to remember that the session has no built-in way to know when the database is changed outside of the session. In order to access externally modified data through existing instances, the instances must be marked as expired. This happens by default on session.commit(), but can be done manually by calling session.expire_all() or session.expire(instance). An example (SQL omitted):

x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42

session.commit() expires x, so the first print statement implicitly opens a new transaction and re-queries x's attributes. If you comment out the first print statement, you'll notice that the second one now picks up the correct value, because the new query isn't emitted until after the update.

This makes sense from the point of view of transactional isolation - you should only pick up external modifications between transactions. If this is causing you trouble, I'd suggest clarifying or re-thinking your application's transaction boundaries instead of immediately reaching for session.expire_all().

dhaffey
Thanks for your reply, I'm going to give that a go. WRT the expiring issue, what I saw wasn't quite the same. I am using a scoped session in turbogears. Performing a getSession().query(Foo).filter....all() returned different things depending on the request, also did not return the updated records that were in the db until I restarted it. I fixed this issue by doing a autocommit=True and adding in something that .remove()d the session after the request was complete (I gather you are meant to do that anyway).
I guess it returned different things depending on the request because it had a scoped session per thread in the pool and the sessions were in different states? It seemed a bit odd that sa wouldn't get new data after a fresh request though. I expect I am missunderstanding what autocommit=False is doing
With `autocommit=False`, I believe you should be calling `session.commit()` upon request completion (I'm not familiar with TurboGears, so ignore this if that's handled for you at the framework level). Besides making sure your changes have made it to the database, this would expire everything in the session. The next transaction wouldn't begin until the next use of that session, so future requests on the same thread wouldn't see stale data.
dhaffey