views:

149

answers:

3

Hi,

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:

d[k] = v  # replace if present
v = d[k]  # None or a negative number if not present

I need to hold ~250M pairs, so it really has to be tight.

Do you happen to know a suitable implementation (Python 2.7)?

EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!


To rephrase. Here's a trivial int-int dictionary with 1M pairs:

>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
...     d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
... 
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 25165960  51  25165960  51 dict (no owner)
     1 1999521 100 23994252  49  49160212 100 int

On average, a pair of integers uses 49 bytes.

Here's an array of 2M integers:

>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
...     a.append(random.randint(0, sys.maxint))
... 
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   7  8000028 100   8000028 100 array.array

On average, a pair of integers uses 8 bytes.

I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?

+2  A: 

You could use the IIBtree from Zope

gnibbler
+4  A: 

8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).

I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.

Kylotan
+4  A: 

I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.

I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.

http://docs.scipy.org/doc/numpy/reference/

You might also find some useful ideas in this thread: http://stackoverflow.com/questions/327223/memory-efficient-alternatives-to-python-dictionaries

Paul McMillan
I agree. I also think that Numpy is among the most probable solutions for (memory) efficient numerical arrays.
extraneon