views:

12

answers:

1

I've got a query that returns a fair number of rows, and have found that

  • We wind up throwing away most of the associated ORM instances; and
  • building up those soon-to-be-thrown-away instances is pretty slow.

So I'd like to build only the instances that I need!

Unfortunately, I can't do this by simply restricting the query; I need to do a fair bit of "business logic" processing on each row before I can tell if I'll throw it out; I can't do this in SQL.

So I was thinking that I could use a MapperExtension to handle this: I'd subclass MapperExtension, and then override create_instance; that method would examine the row data, and either return EXT_CONTINUE if the data is worth building into an instance, or ... something else (I haven't yet decided what) otherwise.

Firstly, does this approach even make sense?

Secondly, if it does make sense, I haven't figured out how to find the data I need in the arguments that get passed to create_instance. I suspect it's in there somewhere, but it's hard to find ... instead of getting a row that directly corresponds to the particular class I'm interested in, I'm getting a row that corresponds to the query that SQLalchemy generated, which happens to be a somewhat complex join between (say) tables A, B, and C.

The problem is that I don't know which elements of the row correspond to the fields in my ORM class: I want to be able to pluck out (e.g.) A.id, B.weight, and C.height.

I assume that somewhere inside the mapper, selectcontext, or class_ arguments is some sort of mapping between columns of my table, and offsets into the row. But I haven't yet found just the right thing. I've come tantalizingly close, though. For example, I've found that selectcontext.statement.columns contains the names of the generated columns ... but not those of the table I'm interested in. For example:

Column(u'A_id', UUID(), ...
...
Column(u'%(32285328 B)s_weight, MSInt(), ...
...
Column(u'%(32285999 C)s_height', MSInt(), ...

So: how do I map column names like C.height to offsets into the row?

+1  A: 

The row accepts Column objects as indexes:

row[MyClass.some_element.__clause_element__()]

but that will only get you as far as the classes and aliased() constructs you have access to on the outside. Its very likely that would be all you'd need for that part of the issue (even though ultimately the idea won't work, read on).

If your statement has had subqueries wrapped around it, from using things like from_self() or join() to a polymorphic target, the create_instance() method doesn't give you access to the translation functions you'd need to accomplish that.

If you're trying to get at rows that are linked to an eagerload(), that's totally not something you should be doing. eagerload() is about optimizing the load of collections. If you want your query to join between two tables and you're looking to filter on the joined table, use join().

But above all, create_instance() is from version 0.1 of SQLAlchemy and I doubt anyone uses it for anything, and it has no capability to say, "skip this row". It has to return something or the mapper will create the instance on its own. So no matter how well you can interpret the row, there's no hook for what you want to do here.

If i really wanted to do such a thing, it would likely be easier to monkeypatch the "fetchall()" method of the returned ResultProxy to filter rows, and send it to Query.instances(). Any result can be sent to this method. Although, if the Query has done translations and such on the mapped selectables, it would need the original QueryContext as well to know how to translate. But this is nothing I'd be bothering with either.

Overall, if speed is so critical of an issue throughout all of this that creating the object is that big of a difference, I'd make it so that I don't need the mapped objects at all for the whole operation, or I'd use caching, or generate the objects I need manually from a result set. I also would make sure that I have access to all the targeted columns in the selectable I'm using so I can re-fetch from result rows, which means I either don't use automatic-subquery/alias generation functions in the ORM, or I use the expression language directly (if you're really hungry for speed and are in the mood to write large tracts of optimizing code, you should probably just be using the expression language).

So the real questions you have to ask here are:

  1. Have you verified that the real difference in speed is creating the object from the row. I.e. not fetching the row, or fetching its columns, etc.
  2. Does the row just have some expensive columns that you don't need? Have you looked into deferred() ?
  3. What are these business rules and why cant they be done in SQL, as stored procedures, etc.
  4. How many thousands of rows are you really skipping here, that its so "slow" to not "skip" them
  5. Have you investigated techniques for having the objects already present, like in-memory caches, preloads, etc. For many scenarios, this fits the bill.
  6. None of this works, and you really want to hack up some home-rolled optimization code. So why not use the SQL expression language directly? If ultimately you're just dealing with a view layer, result rows are quite friendly (they allow "attribute" style access and such), or build some quick "generate an object" routine from it. The ORM presents a very specific use case of the SQL expression language, and if you really need something much more lightweight than it, you're better off skipping it.
zzzeek