ansaurus

Question

Handling large dense matrices in python.

Answer 1

A:

You might find some advice in the NumPy (see SciPy) documentation (arrays/matrices):

Bruno 2010-07-10 09:52:20

Nope - Fortunately, Numpy has the same array size limitations as I've already described.

Fake Name 2010-07-10 10:11:23

Answer 2

+1 A:

For the 20,000 x 20,000 you are looking at 12GB of RAM?

Aren't you going to end up in swap hell trying to work with a 12GB in win32 which artificially limits the memory that the OS can address?

I'd be looking for an OS that can support the 12GB (32 bin win 2003 server can if you need to stick with 32bit windows), but a 64 bit machine with 64bit OS and 16GB of RAM would seem a better fit.

Good excuse for an upgrade :)

64 bit numpy can support your matrix

Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.zeros((20000,20000),dtype=np.uint16)
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint16)

gnibbler 2010-07-10 10:52:21

You don't end up in swap-hell, python just dies, either with a memory error, or silently. I'm asking because I can't believe no one has had a similar issue, and tried to solve it on 32 bit.

Fake Name 2010-07-10 12:29:02

@Fake Name, swap hell will happen if you try to use 12GB of data with a kernel that is limited to 2 or 3GB of RAM whichever way you try to do it. Most likely noone has tried to solve it on 32 bit is because it is much more practical to use a 64 bit machine with sufficient RAM

gnibbler 2010-07-10 12:50:59

Answer 3

A:

You can reduce the memory use by using uint8, but be careful to avoid overflow errors. A uint16 requires two bytes, so the minimal memory requirement in your example is 8000*8000*30*2 bytes = 3.84 Gb.

If the second example fails then you need a new machine. The memory requirement is 20000*20000*2*bytes =800 Mb.

My advice is that you try to create smaller matrices and use "top", "ps v" or the gnome system monitor to check the memory used by your python proces. Start out examining a single thread with a small matrix and do the math. Note that you can release the memory of a variable x by writing del(x). This is useful for testing.

What is the memory on your machine? How much memory does pytables use to create a 20000*20000 table? How much memory does numpy use to create a 20000*20000 table using uint8?

nielsle 2010-07-10 11:44:05

It's an internal numpy error, not a (real) memory error.

Fake Name 2010-07-10 12:16:24

Also, you can't use a uint8 as a reference to a position in a table with 20000 rows, so the choice of uint16 was carefully calculated.

Fake Name 2010-07-10 12:17:07

The problems might be caused by a 32 bit windows limitation:"First, the 32-bit Python interpreter can only use 2 GB of your memory. Second, numpy arrays need a contiguous, un-fragmented, range of memory to be available within that 2 GB address space." (Quoted from http://old.nabble.com/Summation-of-large-float32-float64-arrays-td28638607.html )I think that you can force numpy to use non-contiguous (c-style) arrays, but that will not remove the 2Gb memory limit.

nielsle 2010-07-10 13:09:12

Perhaps the solution is to stop using windows 32. You may wish to have a look at the flag named IMAGE_FILE_LARGE_ADDRESS_AWARE, but I don't know if that is safe to use. Here is a description of Microsoft memory limits:http://msdn.microsoft.com/en-us/library/aa366778(VS.85).aspx

nielsle 2010-07-10 13:30:48

Answer 4

+1 A:

PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.

Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.

I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.

Fake Name 2010-07-11 01:57:44

Answer 5

A:

If you have N objects, kept in a list L, and you wish to store the similarity betwen each object and each other object, that's O(N**2) similarities. Under the common conditions that similarity(A, B) == similarity(B, A) and similarity(A, A) == 0, all that you need is a triangular array S of similarities. The number of elements in that array will be N*(N-1)//2. You should be able to use an array.array for this purpose. Keeping your similarity as a float will take only 8 bytes. If you can represent your similarity as an integer in range(256), you use an unsigned byte as the array.array element.

So that's about 8000 * 8000 / 2 * 8 i.e. about 256 MB. Using only a byte for the similarity means only 32 MB. You could avoid the slow S[i*N-i*(i+1)//2+j] index calculation of the triangle thingie by simulating a square array instead using S[i*N+j]`; memory would double to (512 MB for float, 64 MB for byte).

If the above doesn't suit you, then perhaps you could explain """Each item [in which container?] is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class."" and """I don't want to store numbers, I want to store reference to classes""". Even after replacing "class(es)" by "object(s)", I'm struggling to understand what you mean.

John Machin 2010-07-11 05:10:53

Answer 6

A:

Well, I've found my solution:
h5py

It's a library that basically presents a numpy-like interface, but uses compressed memmapped files to store arrays of arbitrary size (It's basically a wrapper for HDF5).

PyTables is built on it, and PyTables actually led me to it. However, I do not need any of the SQL functionality that is the main offering of PyTables, and PyTables does not provide the clean array-like interface I was really looking for.

h5py basically acts like a numpy array, and simply stores the data in a different format.

It also seems to have no limit to array size, except perhaps disk space. I'm currently doing testing on a 100,000 * 100,000 array of uint16.

Fake Name 2010-07-23 02:47:56

ansaurus

tags:

views:

answers:

Handling large dense matrices in python.

related questions