views:

340

answers:

2

I have the following code runs over a large set of data (2M). It eats up all my 4G mem before finishing.

    for sample in session.query(CodeSample).yield_per(100):
            for proj in projects:
                    if sample.filename.startswith(proj.abs_source):
                            sample.filename = "some other path"
                            session.add(sample)

Then I ran it though a reduced set of data and analyzed heap with heapy. get_rp() gave me the folloing hint

0: _ --- [-] 47821 (0x9163aec | 0x9165fec | 0x916d6cc | 0x9251414 | 0x925704...
 1: a      [-] 8244 tuple: 0x903ec8c*37, 0x903fcfc*13, 0x9052ecc*46...
 2: aa ---- [S] 3446 types.CodeType: parseresult.py:73:src_path...
 3: ab      [S] 364 type: __builtin__.Struct, _random.Random, sqlite3.Cache...
 4: ac ---- [-] 90 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c...
 5: aca      [S] 11 dict of module: ..sql..., codemodel, sqlalchemy
 6: acb ---- [-] 48 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c...
 7: acba      [S] 9 dict of module: ..sql..., codemodel, sqlalchemy
 8: acbb ---- [-] 45 sqlalchemy.sql.visitors.VisitableType: 0x9165fec...
 9: acbba      [S] 8 dict of module: ..sql..., codemodel, sqlalchemy

I'm new to sqlalchemy. Is this a memory leak? Thanks.

+1  A: 

The session will keep track of all the CodeSample objects that you retrieve. So after iterating over 2M objects, the session keeps a reference to all of them. The session needs these references so it can write the correct changes to the database on flush. So I believe what you're seeing is to be expected.

To only keep N objects in memory at a time, you could do something like the code below (inspired by this answer, disclaimer: I have not tested it).

offset = 0
N = 10000
got_rows = True
while got_rows:
    got_rows = False
    for sample in session.query(CodeSample).limit(N).offset(offset):
        got_rows = True
        for proj in projects:
            if sample.filename.startswith(proj.abs_source):
                sample.filename = "some other path"
    offset += N
    session.flush()       # writes changes to DB
    session.expunge_all() # removes objects from session

But the above is a bit clunky, perhaps some SQLAlchemy gurus knows how to better do this.

BTW, you should not need the session.add(), the session tracks changes to the objects. Why do you use yield_per (EDIT: I guess this is to fetch the rows in chunks from the DB, is that correct? The session will keep track of all of them anyway.)

EDIT:

Hmm, looks like there is something I have misunderstood. From the docs:

weak_identity_map: When set to the default value of True, a weak-referencing map is used; instances which are not externally referenced will be garbage collected immediately. For dereferenced instances which have pending changes present, the attribute management system will create a temporary strong-reference to the object which lasts until the changes are flushed to the database, at which point it’s again dereferenced. Alternatively, when using the value False, the identity map uses a regular Python dictionary to store instances. The session will maintain all instances present until they are removed using expunge(), clear(), or purge().

and

prune(): Remove unreferenced instances cached in the identity map.

Note that this method is only meaningful if “weak_identity_map” is set to False. The default weak identity map is self-pruning.

Removes any object in this Session’s identity map that is not referenced in user code, modified, new or scheduled for deletion. Returns the number of objects pruned.

codeape
Thanks for the answer. yield_per is limit the number of result sqlalchemy load into memory. By default, sqllachemy will load all results into memory.
Purui
the "non-weak" identity map and the "prune()" option shouldn't really be used - there's hardly a reason to and their existence is very legacy at this point. We'll probably get rid of them at some point. The Session normally doesn't hold onto any objects that have no pending changes and aren't referenced elsewhere. It took us a long time to get that to work but it works fully now.
zzzeek
+1  A: 

Most DBAPIs, including psycopg2 and mysql-python, fully load all results into memory before releasing them to the client. SQLA's yield_per() option doesn't work around this, with one exception below, which is why its generally not a very useful option(edit: useful in the sense that it begins streaming results before the actual rows are fully fetched).

The exceptions to this behavior are:

  1. Using a DBAPI that doesn't buffer rows. cx_oracle is one, as a result of the natural way OCI works. Not sure about pg8000's behavior, and there's also a new MySQL DBAPI called OurSQL which I am told by its creator does not buffer rows. pg8000 and OurSQL are supported by SQLAlchemy 0.6.
  2. With psycopg2, a "server side cursor" may be used. SQLAlchemy supports a create_engine() flag "server_side_cursors=True" which uses server side cursors for all row-selecting operations. However, because server side cursors are generally expensive and thus will reduce performance for smaller queries, SQLAlchemy 0.6 now supports psycopg2's server side cursor on a per-statement or per-query basis using .execution_options(stream_results=True), where execution_options is available on Query, select(), text(), and Connection. The Query object calls this option when yield_per() is used, so in 0.6 yield_per() in conjunction with psycopg2 is actually useful.
zzzeek
SQLite WILL return the results one-at-a-time, supported by the python bindings (http://docs.python.org/library/sqlite3.html) . But the ORM has to be using that functionality.
Joe Koberg
I don't understand here. In my test (0.5.8), yield_per did reduced memory consumption. If I use for c in session.query(CodeSample).all():The program allocated memory like crazy. But when I use yield_per, it finished successfully.
Purui
Well, yield_per() will prevent it from creating N number of *ORM* objects in memory at a time - so in that sense it has some utility regarding memory. But it won't in those cases where the DBAPI doesn't support streaming be able to yield ORM objects until the underlying result set is fully fetched.
zzzeek