I have a Django application that exhibits some strange garbage collection behavior. There is one view in particular that will just keep growing the VM size significantly every time it is called - up to a certain limit, at which point usage drops back again. The problem is that it's taking considerable time until that point is reached, and in fact the virtual machine running my app doesn't have enough memory for all FCGI processes to take as much memory as they then sometimes do.
I've spent the last two days investigating this and learning about Python garbage collection, and I think I do understand what is happening now - for the most part. When using
gc.set_debug(gc.DEBUG_STATS)
Then for a single request, I see the following output:
>>> c = django.test.Client()
>>> c.get('/the/view/')
gc: collecting generation 0...
gc: objects in each generation: 724 5748 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 731 6460 147341
gc: done.
[...more of the same...]
gc: collecting generation 1...
gc: objects in each generation: 718 8577 147341
gc: done.
gc: collecting generation 0...
gc: objects in each generation: 714 0 156614
gc: done.
[...more of the same...]
gc: collecting generation 0...
gc: objects in each generation: 715 5578 156612
gc: done.
So essentially, a huge amount of objects are allocated, but are initially moved to generation 1, and when gen 1 is sweeped during the same request, they are moved to generation 2. If I do a manual gc.collect(2) afterwards, they are removed. And, as I mentioned, there also removed when the next automatic gen 2 sweep happens, which, if I understand correctly, would in this case something like every 10 requests (at this point the app needs about a 150MB).
Alright, so initially I thought that there might be some cyclic referencing going on within the processing of one request that prevents any of these objects from being collected within the handling of that request. However, I've spent hours trying to find one using pympler.muppy and objgraph, both after and by debugging inside the request processing, and there don't seem to be any. Rather, it seems the 14.000 objects or so that are created during the request are all within a reference chain to some request-global object, i.e. once the request goes away, they can be freed.
That has been my attempt at explaining it, anyway. However, if that's true and there are indeed no cycling dependencies, shouldn't the whole tree of objects be freed once whatever request object that causes them to be held goes away, without the garbage collector being involved, purely by virtue of the reference counts dropping to zero?
With that setup, here are my questions:
Does the above even make sense, or do I have to look for the problem elsewhere? Is it just an unfortunate accident that significant data is kept around for so long in this particular use case?
Is there anything I can do to avoid the issue. I already see some potential to optimize the view, but that appears to be a solution with limited scope - although I am not sure what I generic one would be, either; how advisable is it for example to call gc.collect() or gc.set_threshold() manually?
In terms of how the garbage collector itself works:
Do I understand correctly that an object is always moved to the next generation if a sweep looks at it and determines that the references it has are not cyclic, but can in fact be traced to a root object.
What happens if the gc does a, say, generation 1 sweep, and finds an object that is referenced by an object within generation 2; does it follow that relationship inside generation 2, or does it wait for a generation 2 sweep to occur before analyzing the situation?
When using gc.DEBUG_STATS, I care primarily about the "objects in each generation" info; however, I keep getting hundreds of "gc: 0.0740s elapsed.", "gc: 1258233035.9370s elapsed." messages; they are totally inconvenient - it takes considerable time for them to be printed out, and they make the interesting things a lot harder to find. Is there a way to get rid of them?
I don't suppose there is a way to do a gc.get_objects() by generation, i.e. only retrieve the objects from generation 2, for example?