ansaurus

Question

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

Answer 1

+3 A:

Python ints are internally represented as C longs (it's actually a bit more complicated than that), but that's not really the root of your problem.

The biggest overhead is your usage of dicts. (defaultdicts and dicts are about the same in this description). dicts are implemented using hash tables, which is nice because it gives quick lookup of pretty general keys. (It's not so necessary when you only need to look up sequential numerical keys, since they can be laid out in an easy way to get to them.)

A dict can have many more slots than it has items. Let's say you have a dict with 3x as many slots as items. Each of these slots needs room for a pointer to a key and a pointer serving as the end of a linked list. That's 6x as many points as numbers, plus all the pointers to the items you're interested in. Consider that each of these pointers is 8 bytes on your system and that you have 16384 defaultdicts in this situation. As a rough, handwavey look at this, 16384 occurrences * (8192 items/occurance) * 7 (pointers/item) * 8 (bytes/pointer) = 7 GB. This is before I've gotten to the actual numbers you're storing (each unique number of which is itself a Python dict), the outer dict, that numpy array, or the stuff Python's keeping track of to try to optimize some.

Your overhead sounds a little higher than I suspect and I would be interested in knowing whether that 11GB was for a whole process or whether you calculated it for just table. In any event, I do expect the size of this dict-of-defaultdicts data structure to be orders of magnitude bigger than the numpy array representation.

As to "is there any way to avoid these costs?" the answer is "use numpy for storing large, fixed-size contiguous numerical arrays, not dicts!" You'll have to be more specific and concrete about why you found such a structure necessary for better advice about what the best solution is.

Mike Graham 2010-04-30 21:36:10

The 11 GB is for the whole process.

2010-04-30 21:48:39

So are you arguing that a python list of those values shouldn't take up as much space? Since your are making dict-specific remarks? Because my second code snippet also uses a huge quantity of RAM.

2010-04-30 21:49:57

@dukhat, `list` is implemented very differently from `dict`. A list can carve out only twice as much memory as its size and doesn't need to store pointers for things like keys. Keeping a list will still be much, much more memory-hungry than using a numpy array. This is one of the major reasons we use numpy arrays.

Mike Graham 2010-04-30 21:52:17

Answer 2

+1 A:

Well, look at what your code is actually doing:

topKeys = range(16384)
table = dict((k,defaultdict(int)) for k in topKeys)

This creates a dict holding 16384 defaultdict(int)'s. A dict has a certain amount of overhead: the dict object itself is between 60 and 120 bytes (depending on the size of pointers and ssize_t's in your build.) That's just the object itself; unless the dict is less than a couple of items, the data is a separate block of memory, between 12 and 24 bytes, and it's always between 1/2 and 2/3rds filled. And defaultdicts are 4 to 8 bytes bigger because they have this extra thing to store. And ints are 12 bytes each, and although they're reused where possible, that snippet won't reuse most of them. So, realistically, in a 32-bit build, that snippet will take up 60 + (16384*12) * 1.8 (fill factor) bytes for the table dict, 16384 * 64 bytes for the defaultdicts it stores as values, and 16384 * 12 bytes for the integers. So that's just over a megabyte and a half without storing anything in your defaultdicts. And that's in a 32-bit build; a 64-bit build would be twice that size.

Then you create a numpy array, which is actually pretty conservative with memory:

dat = num.zeros((16384,8192), dtype="int32")

This will have some overhead for the array itself, the usual Python object overhead plus the dimensions and type of the array and such, but it wouldn't be much more than 100 bytes, and only for the one array. It does store 16384*8192 int32's in your 512Mb though.

And then you have this rather peculiar way of filling this numpy array:

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

The two loops themselves don't use much memory, and they re-use it each iteration. However, table[k][j] creates a new Python integer for each value you request, and stores it in the defaultdict. The integer created is always 0, and it so happens that that always gets reused, but storing the reference to it still uses up space in the defaultdict: the aforementioned 12 bytes per entry, times the fill factor (between 1.66 and 2.) That lands you close to 3Gb of actual data right there, and 6Gb in a 64-bit build.

On top of that the defaultdicts, because you keep adding data, have to keep growing, which means they have to keep reallocating. Because of Python's malloc frontend (obmalloc) and how it allocates smaller objects in blocks of its own, and how process memory works on most operating systems, this means your process will allocate more and not be able to free it; it won't actually use all of the 11Gb, and Python will re-use the available memory inbetween the large blocks for the defaultdicts, but the total mapped address space will be that 11Gb.

Thomas Wouters 2010-04-30 21:43:14

Answer 3

+1 A:

Mike Graham gives a good explanation of why dictionaries use more memory, but I thought that I'd explain why your table dict of defaultdicts starts to take up so much memory.

The way that the defaultdict (DD) is set-up right now, whenever you retrieve an element that isn't in the DD, you get the default value for the DD (0 for your case) but also the DD now stores a key that previously wasn't in the DD with the default value of 0. I personally don't like this, but that's how it goes. However, it means that for every iteration of the inner loop, new memory is being allocated which is why it is taking forever. If you change the lines

for k in topKeys:
    for j in keys:
        dat[k,j] = table[k][j]

to

for k in topKeys:
    for j in keys:
        if j in table[k]:
            dat[k,j] = table[k][j]
        else:
            dat[k,j] = 0

then default values aren't being assigned to keys in the DDs and so the memory stays around 540 MB for me which is mostly just the memory allocated for dat. DDs are decent for sparse matrices though you probably should just use the sparse matrices in Scipy if that's what you want.

Justin Peel 2010-04-30 21:45:59

If you change the loop that way, it'll also take a lot less time -- it won't *do* anything, because all the `defaultdict`s are empty :) Creating the object automatically and inserting it in the dict is the purpose of a `defaultdict`, the reason it was added. If you don't want it to be inserted, you really shouldn't be using a defaultdict (but, for example, a custom class with a __getitem__ that does exactly what you want.)

Thomas Wouters 2010-04-30 21:48:58

Exactly, but it wasn't clear to me that the OP understood what was going on in the DD. Personally, I think that the DD shouldn't add the key to the DD if the value is just being requested of a key that isn't in the DD. Only if the key is trying to be set should it be added, but that's just my opinion.

Justin Peel 2010-04-30 21:56:42

@Justin, Consider the common case of `foo = defaultdict(list)`, `foo[bar].append(baz)`.

Mike Graham 2010-04-30 22:16:36

@Mike That's a good point. I guess it has to be that way to be able to deal with defaultdicts where the values are arrays, but I still think that it sets the stage for a big memory mess.

Justin Peel 2010-04-30 23:20:42

@Justin, `dict` has a `get` method for what you want `defaultdict` to do. What `defaultdict` does is simpler than what you want it to do, more consistent, and doesn't lead to a memory mess when you use `defaultdict` in the ways I've always used it and the ways I've generally seen it used in the wild. It isn't really intended for what OP was using it for, so I really don't seem much harm in its design.

Mike Graham 2010-04-30 23:48:48

@Mike Yeah, I've just seen several questions on here because people were using defaultdict in this manner so it gives me that impression. I did know about dict's get method.

Justin Peel 2010-04-30 23:58:23

ansaurus

tags:

views:

answers:

Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)

related questions