views:

185

answers:

4

Hi,

Say there is a dict variable that grows very large during runtime- up into millions of key:value pairs.

Does this variable get stored in RAM,effectively using up all the available memory and slowing down the rest of the system?

Asking the interpreter to display the entire dict is a bad idea, but would it be okay as long as one key is accessed at a time?

-Tim

+3  A: 

Yes, a Python dict is stored in RAM. A few million keys isn't an issue for modern computers, however. If you need more and more data and RAM is running out, consider using a real database. Options include a relational DB like SQLite (built-in in Python, by the way) or a key-value store like Redis.

It makes little sense displaying millions of items in the interpreter, but accessing a single element should be still very efficient.

Eli Bendersky
What about bsddb?
tstenner
+1  A: 

For all I know Python uses the best hashing algorithms so you are probably going to get the best possible memory efficiency and performance. Now, whether the whole thing is kept in RAM or committed to a swap file is up to your OS and depends on the amount of RAM you have. What I'd say is best if to just try it:

from random import randint
a = {}
for i in xrange(10*10**6):
    a[i] = i

How is this looking when you run it? Takes about 350Mb on my system which should be manageable to say the least.

kaloyan
the comp i'm stuck with right now has 512 mb of ram, which is why i'm concerned. however the most keys i'll have is about 10k, so i don't think it should be a problem. Thanks for the test though I wouldn't try that on this one.
PPTim
Most OSes are pretty smart about managing memory and swap. You really should be fine with a dictionary of any size as long as you have the hard drive space for the swap file.
kaloyan
I'm guessing there's a random factor missing from that snippet - possibly the dictionary key?
Kylotan
well each key is equivalent to the value stored- but it still works
PPTim
@Kylotan, @PPTim: Legit remarks. dict((i, -i) for i in xrange(10**7)) is about 440Mb. If you make the values not only unique but floats you get up to 480Mb: dict((i, float(i) / 10**7) for i in xrange(10**7)). Both are still nothing virtual memory can't handle.
kaloyan
+5  A: 

The main concern with the millions of items is not the dictionary itself so much as how much space each of these items takes up. Still, unless you're doing something weird, they should probably fit.

If you've got a dict with millions of keys, though, you're probably doing something wrong. You should do one or both of:

  1. Figure out what data structure you should actually be using, because a single dict is probably not the right answer. Exactly what this would be depends on what you're doing.

  2. Use a database. Your Python should come with a sqlite3 module, so that's a start.

kwatford
+2  A: 

Yes, the dict will be stored in the process memory. So if it gets large enough that there's not enough room in the system RAM, then you can expect to see massive slowdown as the system starts swapping memory to and from disk.

Others have said that a few million items shouldn't pose a problem; I'm not so sure. The dict overhead itself (before counting the memory taken by the keys and values) is significant. For Python 2.6 or later, sys.getsizeof gives some useful information about how much RAM various Python structures take up. Some quick results, from Python 2.6 on a 64-bit OS X machine:

>>> from sys import getsizeof
>>> getsizeof(dict((n, 0) for n in range(5462)))/5462.
144.03368729403149
>>> getsizeof(dict((n, 0) for n in range(5461)))/5461.
36.053470060428495

So the dict overhead varies between 36 bytes per item and 144 bytes per item on this machine (the exact value depending on how full the dictionary's internal hash table is; here 5461 = 2**14//3 is one of the thresholds where the internal hash table is enlarged). And that's before adding the overhead for the dict items themselves; if they're all short strings (6 characters or less, say) then that still adds another >= 80 bytes per item (possibly less if many different keys share the same value).

So it wouldn't take that many million dict items to exhaust RAM on a typical machine.

Mark Dickinson
thanks, learned about getsizeof from this. Practically I will only be dealing with ~15k values at most, and speed is of the essense. I'm using a dict simply because i haven't touched databases at all, but I'm assuming a DB which reads and writes from a harddrive would be slower than reading/writing to a dict?
PPTim
Okay, for a dict that size you shouldn't have any trouble. What are the types of the keys and values? Strings?
Mark Dickinson
mostly floats, some strings, a few lists
PPTim