views:

497

answers:

4

I'm trying to optimize some Python code. The profiler tells me that SQLAlchemy's _get_col() is what's killing performance. The code looks something like this:

lots_of_rows = get_lots_of_rows()
for row in lots_of_rows:
    if row.x == row.y:
        print row.z

I was about to go through the code and make it more like this...

lots_of_rows = get_lots_of_rows()
for row in lots_of_rows:
    if row[0] == row[1]:
        print row[2]

...but I've found some documentation that seems to indicate that when accessing row objects like arrays, you're actually still pulling dictionary keys. In other words, the row object looks like this:

'x': (x object)
'0': (x object)
'y': (y object)
'1': (y object)
'z': (z object)
'2': (z object)

If that's the case, I doubt I'll see any performance improvement from accessing columns by number rather than name. Is there any way to get SA to return results as a list of tuples, or a list of lists, rather than a list of dictionaries? Alternately, can anyone suggest any other optimizations?

A: 

I think row.items() is what you're looking for. It returns a list of (key, value) tuples for the row.

Link

DNS
Sadly, items() is just this: return [(key, getattr(self, key)) for key in self.keys()] ... and __getattr__() calls _get_col(). So it's actually even slower that way.
mike
+1  A: 

Forgive the obvious answer, but why isn't row.x == row.y in your query? For example:

mytable.select().where(mytable.c.x==mytable.c.y)

Should give you a huge performance boost. Read the rest of the documentation.

Ali A
I guess my oversimplified example is too oversimplified. :) The actual filter is more complex than that.
mike
Then let's see the real thing please. In general, if you are getting data out of a database, and then iterating it to filter, you are probably doing something heinous.
Ali A
In actuality, the filter is a complex regular expression.
mike
Mike: please can I see it? As your question stands, this is the only sensible answer I can see.
Ali A
I will agree: filtering on the client is rarely the correct answer.
Godeke
A: 

You should post your profiler results as well as stack traces around the '_get_col' call so we know which _get_col is being called. (and whether _get_col really is the bottleneck).

I looked at the sqlalchemy source, looks like it may be calling 'lookup_key' (in engine/base.py) each time and it looks like this caches the column value locally, i guess lazily (via PopulateDict).

You can try bypassing that by directly using row.__props (not recommended since it's private), maybe you can row.cursor, but it looks like you would gain much by bypassing sqlalchemy (except the sql generation) and working directly w/ a cursor.

-- J

+1  A: 

SQLAlchemy proxies all access to the underlying database cursor to map named keys to positions in the row tuple and perform any necessary type conversions. The underlying implementation is quite heavily optimized, caching almost everything. Looking over the disassembly the only ways to further optimize seem to be to throw out extensibility and get rid of a couple of attribute lookups or to resort to dynamic code generation for smaller gains, or to gain more, implement the corresponding ResultProxy and RowProxy classes in C.

Some quick profiling shows that the overhead is around 5us per lookup on my laptop. That will be significant if only trivial processing is done with the data. In those kind of cases it might be reasonable to drop down to dbapi level. This doesn't mean that you have to lose the query building functionality of SQLAlchemy. Just execute the statement as you usually would and get the dbapi cursor from the ResultProxy by accessing result.cursor.cursor. (result.cursor is an SQLAlchemy CursorFairy object) Then you can use the regular dbapi fetchall(), fetchone() and fetchmany() methods.

But if you really are doing trivial processing it might be useful to do it, or at least the filtering part on the database server. You probably lose database portability, but that might not be an issue.

Ants Aasma