views:

2347

answers:

6

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.

So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?

+19  A: 

Nope. But you can use short integers in arrays:

from array import array
a = array("h") # h = signed short, H = unsigned short

As long as the value stays in that array it will be a short integer.

Armin Ronacher
A better and more complete answer than my own. :)
Nick Johnson
So, is an array('h') with only one element, the same as creating a short integer?
Arnav
@Arnav: nope. that would be a PyObject + a short integer.
Armin Ronacher
+2  A: 

Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:

From the documentation (http://docs.python.org/lib/module-struct.html):

>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
Arnav
+2  A: 

Armin's suggestion of the array module is probably best. Two possible alternatives:

  • You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then that's pretty simple to do.
  • You can cheat and manipulate bits, so that you're storing one number in the lower half of the Python int, and another one in the upper half. You'd write some utility functions to convert to/from these within your data structure. Ugly, but it can be made to work.

It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).

I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).

Tony Meyer
Can I use IIBTree without installing all of Zope? Where do I get it? What's an IOBTree?
Greg
Just install ZODB (http://pypi.python.org/pypi/ZODB3/3.8.0). An IOBTree is a BTree that has integer keys (the I) and object values (the O).
Tony Meyer
A: 

@Armin: how come? The Python documentation said the minimum size for that array of short integer is 2 bytes and

The actual representation of values is determined by the machine architecture (strictly speaking, by the C implementation). The actual size can be accessed through the itemsize attribute.

@Arnav: I suggest that your code should check the size of each Type code and choose the corresponding 2-byte type that is specific to the underlying system.

Martin
+2  A: 

If you're doing any sort of manipulation of this huge dataset, you'll probably want to use Numpy, which has support for a wide variety of numeric types, and efficient operations on arrays of them.

giltay
A: 

@Arnav:

I'm confused by your solution. Are you saying dict2 will use less actual memory than dict1 in my little example below?

>>> import struct
>>> dict1={}
>>> for i in range(10000):
...     dict1[i]=i
>>> dict2={}
>>> for i in range(10000):
...     dict2[struct.pack('H',i)]=struct.pack('H',i)

Does anyone else know?

Greg