Basically, what is the best way to go about storing and using dense matrices in python?
I have a project that generates similarity metrics between every item in an array.
Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.
Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:
numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.
It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.
Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.
>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>>
Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.
I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.
Any advice?
Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2
to O*((n*(n-1))/2)
.
Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.
It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.