ansaurus

Question

Is this a memory leak ( a program in python with sqlalchemy/sqlite)

Answer 1

+1 A:

The session will keep track of all the CodeSample objects that you retrieve. So after iterating over 2M objects, the session keeps a reference to all of them. The session needs these references so it can write the correct changes to the database on flush. So I believe what you're seeing is to be expected.

To only keep N objects in memory at a time, you could do something like the code below (inspired by this answer, disclaimer: I have not tested it).

offset = 0
N = 10000
got_rows = True
while got_rows:
    got_rows = False
    for sample in session.query(CodeSample).limit(N).offset(offset):
        got_rows = True
        for proj in projects:
            if sample.filename.startswith(proj.abs_source):
                sample.filename = "some other path"
    offset += N
    session.flush()       # writes changes to DB
    session.expunge_all() # removes objects from session

But the above is a bit clunky, perhaps some SQLAlchemy gurus knows how to better do this.

BTW, you should not need the session.add(), the session tracks changes to the objects. Why do you use yield_per (EDIT: I guess this is to fetch the rows in chunks from the DB, is that correct? The session will keep track of all of them anyway.)

EDIT:

Hmm, looks like there is something I have misunderstood. From the docs:

weak_identity_map: When set to the default value of True, a weak-referencing map is used; instances which are not externally referenced will be garbage collected immediately. For dereferenced instances which have pending changes present, the attribute management system will create a temporary strong-reference to the object which lasts until the changes are flushed to the database, at which point it’s again dereferenced. Alternatively, when using the value False, the identity map uses a regular Python dictionary to store instances. The session will maintain all instances present until they are removed using expunge(), clear(), or purge().

and

prune(): Remove unreferenced instances cached in the identity map.

Note that this method is only meaningful if “weak_identity_map” is set to False. The default weak identity map is self-pruning.

Removes any object in this Session’s identity map that is not referenced in user code, modified, new or scheduled for deletion. Returns the number of objects pruned.

codeape 2010-01-27 07:48:51

Thanks for the answer. yield_per is limit the number of result sqlalchemy load into memory. By default, sqllachemy will load all results into memory.

Purui 2010-01-27 19:00:38

the "non-weak" identity map and the "prune()" option shouldn't really be used - there's hardly a reason to and their existence is very legacy at this point. We'll probably get rid of them at some point. The Session normally doesn't hold onto any objects that have no pending changes and aren't referenced elsewhere. It took us a long time to get that to work but it works fully now.

zzzeek 2010-01-28 21:00:00

Answer 2

+1 A:

Most DBAPIs, including psycopg2 and mysql-python, fully load all results into memory before releasing them to the client. SQLA's yield_per() option doesn't work around this, with one exception below, which is why its generally not a very useful option(edit: useful in the sense that it begins streaming results before the actual rows are fully fetched).

The exceptions to this behavior are:

Using a DBAPI that doesn't buffer rows. cx_oracle is one, as a result of the natural way OCI works. Not sure about pg8000's behavior, and there's also a new MySQL DBAPI called OurSQL which I am told by its creator does not buffer rows. pg8000 and OurSQL are supported by SQLAlchemy 0.6.
With psycopg2, a "server side cursor" may be used. SQLAlchemy supports a create_engine() flag "server_side_cursors=True" which uses server side cursors for all row-selecting operations. However, because server side cursors are generally expensive and thus will reduce performance for smaller queries, SQLAlchemy 0.6 now supports psycopg2's server side cursor on a per-statement or per-query basis using .execution_options(stream_results=True), where execution_options is available on Query, select(), text(), and Connection. The Query object calls this option when yield_per() is used, so in 0.6 yield_per() in conjunction with psycopg2 is actually useful.

zzzeek 2010-01-28 20:56:38

SQLite WILL return the results one-at-a-time, supported by the python bindings (http://docs.python.org/library/sqlite3.html) . But the ORM has to be using that functionality.

Joe Koberg 2010-01-28 20:59:50

I don't understand here. In my test (0.5.8), yield_per did reduced memory consumption. If I use for c in session.query(CodeSample).all():The program allocated memory like crazy. But when I use yield_per, it finished successfully.

Purui 2010-01-30 20:00:28

Well, yield_per() will prevent it from creating N number of *ORM* objects in memory at a time - so in that sense it has some utility regarding memory. But it won't in those cases where the DBAPI doesn't support streaming be able to yield ORM objects until the underlying result set is fully fetched.

zzzeek 2010-02-01 22:15:02

ansaurus

tags:

views:

answers:

Is this a memory leak ( a program in python with sqlalchemy/sqlite)

related questions