views:

516

answers:

4

GAE has various limitations, one of which is size of biggest allocatable block of memory amounting to 1Mb (now 10 times more, but that doesn't change the question). The limitation means that one cannot put more then some number of items in list() as CPython would try to allocate contiguous memory block for element pointers. Having huge list()s can be considered bad programming practice, but even if no huge structure is created in program itself, CPython maintains some behind the scenes.

It appears that CPython is maintaining single global list of objects or something. I.e. application that has many small objects tend to allocate bigger and bigger single blocks of memory.

First idea was gc, and disabling it changes application behavior a bit but still some structures are maintained.

A simplest short application that experience the issue is:

a = b = []
number_of_lists = 8000000
for i in xrange(number_of_lists):
    b.append([])
    b = b[0]

Can anyone enlighten me how to prevent CPython from allocating huge internal structures when having many objects in application?

A: 

I'm a bit confused as to what you're asking. In that code example, nothing should be garbage collected, as you're never actually killing off any references. You're holding a reference to the top level list in a and you're adding nested lists (held in b at each iteration) inside of that. If you remove the 'a =', then you've got unreferenced objects.

Edit: In response to the first part, yes, Python holds a list of objects so it can know what to cull. Is that the whole question? If not, comment/edit your question and I'll do my best to help fill in the gaps.

Cody Brocious
I've put some background in question for clarification. Does that help?
myroslav
A: 

What are you trying to accomplish with the

a = b = []

and

b = b[0]

statements? It's certainly odd to see statements like that in Python, because they don't do what you might naively expect: in that example, a and b are two names for the same list (think pointers in C). If you're doing a lot of manipulation like that, it's easy to confuse the garbage collector (and yourself!) because you've got a lot of strange references floating around that haven't been properly cleared.

It's hard to diagnose what's wrong with that code without knowing why you want to do what it appears to be doing. Sure, it exposes a bit of interpreter weirdness... but I'm guessing you're approaching your problem in an odd way, and a more Pythonic approach might yield better results.

kquinn
I think it was simply an example given to ask about some underlying GC structure, but I couldn't make it out either.
Cody Brocious
Yeah, it's just a *weird* example. a = b = [] is easy enough to untangle (two references to one list which starts empty), but b = b[0] is giving my mind fits trying to think about the internals of CPython. Perhaps I should just go to sleep already....
kquinn
The example program is something simplest to spawn huge amounts of linked objects that cannot be garbage collected. After program execution points to outmost containing list(), each list contains single element that is another list containing another single element...
myroslav
It really just causes it to nest lists (inside of the original, held in 'a'). Took me a second to grok too.
Cody Brocious
The code is not wrong unless for the case that GAE kills application due to some internal CPython structures that span largest allocatable memory block. And I'm trying to figure out how to workaround that.
myroslav
+7  A: 

On a 32-bit system, each of the 8000000 lists you create will allocate 20 bytes for the list object itself, plus 16 bytes for a vector of list elements. So you are trying to allocate at least (20+16) * 8000000 = 20168000000 bytes, about 20 GB. And that's in the best case, if the system malloc only allocates exactly as much memory as requested.

I calculated the size of the list object as follows:

  • 2 Pointers in the PyListObject structure itself (see listobject.h)
  • 1 Pointer and one Py_ssize_t for the PyObject_HEAD part of the list object (see object.h)
  • one Py_ssize_t for the PyObject_VAR_HEAD (also in object.h)

The vector of list elements is slightly overallocated to avoid having to resize it at each append - see list_resize in listobject.c. The sizes are 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ... Thus, your one-element lists will allocate room for 4 elements.

Your data structure is a somewhat pathological example, paying the price of a variable-sized list object without utilizing it - all your lists have only a single element. You could avoid the 12 bytes overallocation by using tuples instead of lists, but to further reduce the memory consumption, you will have to use a different data structure that uses fewer objects. It's hard to be more specific, as I don't know what you are trying to accomplish.

oefe
A: 

So that you're aware of it, Python has its own allocator. You can disable it using --without-pyalloc during the configure step.

However, the largest arena is 256KB so that shouldn't be the problem. You can also compile Python with debugging enabled, using --with-pydebug. This would give you more information about memory use.

I suspect your hunch and am sure that oefe's diagnosis are correct. A list uses contiguous memory, so if your list gets too large for a system arena then you're out of luck. If you're really adventurous you can reimplement PyList to use multiple blocks, but that's going to be a lot of work since various bits of Python expect contiguous data.

Andrew Dalke