tags:

views:

545

answers:

9

How can I store a 100K X 100K matrix in Java?

I can't do that with a normal array declaration as it is throwing a java.lang.OutofMemoryError.

+2  A: 

You should try an "external" package to handle matrices, I never did that though, maybe something like jama.

Soufiane Hassou
+1 for suggesting external package, -1 for suggesting Jama.
mobrule
+6  A: 

100,000 x 100,000 = 10,000,000,000 (10 billion) entries. Even if you're storing single byte entries, that's still in the vicinity of 10 GB - does your machine even have that much physical memory, let alone have a will to allocate that much to a single process?

Chances are you're going to need to look into some kind of a way to only keep part of the matrix in memory at any given time, and the rest buffered on disk.

Amber
Is there a way to only keep part of the matrix in memory at given time and the rest buffered on disk?
Deepak Konidena
@Deepak sure. save the matrix to a file or database, and only use a range of the matrix you are currently "near". Example: if this where for a map in a game, you could reasonably only keep in memory the immediate area around, and load more when you need it.
seanmonstar
You'd think that Java would have cottoned on that most OS support virtual memory by now.
Pete Kirkham
@Pete - I'm sure that the Java designers are well aware if this. But what OSes do NOT do is allow user-level code to hook into the OS virtual memory system and implement application specific paging.
Stephen C
@Stephen You don't need 'application specific' paging to use virtual memory, even with a garbage collector. It can help with some of the larger-but-not-full collection cycles, but it's not essential.
Pete Kirkham
@Pete - so what DID you mean by your previous comment then ... in the context of representing an 80Gbyte array??
Stephen C
@Stephen Sun's JRE (at least up to 6) does not allow you to set the maximum memory size in either windows or linux to take advantage of virtual memory. Instead it limits it to physical memory size. Although gc does interact less well with paging than other schemes, and some experimental gc systems extend linux' virtual memory api to allow the application/user level to interact with it in more detailed ways to mitigate this, it would be nicer to be allowed to chose to use virtual memory and decide, rather than being limited to only physical memory
Pete Kirkham
@Pete Kirkham - I don't believe your last comment is true. Can you please provide evidence. My experience is that you **can** set the heap size greater than the amount of physical memory, provided you don't exceed the available virtual memory and/or the addressability constraints. (The latter refers to the fact that the hardware and OS place limits on the amount of virtual memory that a single user-space process can address. And the JVM uses some of this for non-heap uses.)
Stephen C
@Stephen C I've tried previously on 32 bit linux and 32 bit windows and failed with ; I don't have access to suitable machines at the moment (or plan to install an OS or remove memory to try again). I just tried on 64 bit Linux machine and it was happy to take larger than physical values, the results for one 32 bit linux machine are at http://stackoverflow.com/questions/461260/is-there-a-maximum-number-you-can-set-xmx-to-when-trying-to-increase-jvm-memory , though that has no swap
Pete Kirkham
+14  A: 

The Colt library has a sparse matrix implementation for Java.

You could alternatively use Berkeley DB as your storage engine.

Now if your machine has enough actual RAM (at least 9 gigabytes free), you can increase the heap size in the Java command-line.

jspcal
"increase the stack size" ... you mean hep size I think
Stephen C
yeah the initial/max heap size with -Xms/-Xmx
jspcal
I thought Java can handle memory only up chunks.. that is, it can trace a lot of objects, but each one has to be 2 GB or less.. am I wrong?
lmsasu
Because colt's homepage is somewhat tricky to find: http://acs.lbl.gov/~hoschek/colt/
Andreas_D
@Imsasu Java doesn't have real 2D arrays, so it would be an array of 100,000 arrays of 100,000 double, not a single multi-gigabyte object
Pete Kirkham
+3  A: 

Well, I'd suggest that you increase the memory in your jvm but you've going to need a lot of memory, as you're talking about 10 billion items. It's (barely) possible with lots of memory or a clustered jvm, but that's probably the wrong answer.

  • You're getting the outOfmemory because if you declare int[1000], the memory is allocated immediately (additionally doubles take up more space than ints-an int representation will also save you space). Maybe you can substitute a more efficient implementation of your array (if you have many empty entries lookup "sparse matrix" representations).

  • You could store pieces in an outside system, like memcached or memory-mapped buffers.

There are lots of good suggestions here, maybe if you posted a more detailed description of the problem you're trying to solve people could be more specific.

Steve B.
I am trying to represent an adjacency matrix of a graph whose size is 100K nodes. Thanks.
Deepak Konidena
+10  A: 

If the vast majority of entries in your matrix will be zero (or even some other constant value) a sparse matrix will be suitable. Otherwise it might be possible to rewrite your algorithm so that the whole matrix doesn't exist simultaneously. You could produce and consume one row at a time, for example.

Greg Ball
Hey, it is infact a sparse matrix with only few non-zero entries. Thanks.
Deepak Konidena
+4  A: 

You can upgrade to this machine:

http://www.azulsystems.com/products/compute_appliance.htm

864 processor cores and 768 GB of memory, only costs a single family house somewhere.

irreputable
20 of these http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html might do too
Pete Kirkham
+5  A: 

There are a number possible solutions depending on how much memory you have, how sparse the array actually is, and what the access patterns are going to be.

If the calculation of 100K * 100K * 8 is less than the amount of physical memory on your machine for use by the JVM, a simple non-sparse array is viable solution.

If the array is sparse, with (say) 75% or more of the elements being zero, then you can save space by using a sparse array library. Various alternatives have been suggested, but in all cases, you still need to work out if this is going to give you enough savings. Figure out how many non-zero elements there are going to be, multiply that by 8 (to give you doubles) and (say) 4 to account for the overheads of the sparse array. If that is less than the amount of physical memory that you can make available to the JVM, then sparse arrays are a viable solution.

If sparse and non-sparse arrays (in memory) won't work, things will get more complicated, and the viability of any solution will depend on the access patterns for the array data.

  • One approach is to represent the array as a file that is mapped into memory in the form of a MappedByteBuffer. Assuming that you don't have enough physical memory to store the entire file in memory, you are going to be hitting the virtual memory system hard. So it is best if your algorithm only needs to operate on contiguous sections of the array at any time. Otherwise, you'll probably die from swapping.

  • A second approach is a variation of the first. Map the array/file a section at a time, and when you are done, unmap and move to the next section. This only works if the algorithm works on the array in sections.

  • A third approach is to represent the array using a light-weight database like BDB. This will be slower than any in-memory solution because reading array elements will translate into disc accesses. But if you get it wrong it won't kill the system like the memory mapped approach will. (And if you do this on Linux/Unix, the system's disc block cache may speed things up, depending on your algorithm's array access patterns)

  • A fourth approach is to use a distributed memory cache. This replaces disc i/o with network i/o, and it is hard to say whether this is a good or bad thing.

  • A fifth approach is to analyze your algorithm and see if it is amenable to implementing as a distributed algorithm; e.g. with sections of the array and corresponding parts of the algorithm on different machines.

Stephen C
doubles use 8 bytes rather than 4.
Peter Lawrey
Duh ... I knew that!
Stephen C
+2  A: 

Unless you have 100K x 100K x 8 ~ 80GB of memory, you cannot create this matrix in memory. You can create this matrix on disk and access it using memory mapping. However, using this approach will be very slow.

What are you trying to do? You may find that representing your data in a different way will be much more efficient.

Peter Lawrey
+3  A: 

Sounds like you need a sparse matrix. Others have already suggested good 3rd party implementations that may suite your needs...

Depending on your applications, you could get away without a third-party matrix library by just using a Map as a backing-store for your matrix data. Kind of...

public class SparseMatrix<T> {
    private T defaultValue;
    private int m;
    private int n;
    private Map<Integer, T> data = new TreeMap<Integer, T>();
    /// create a new matrix with m rows and n columns
    public SparseMatrix(int m, int n, T defaultValue) {
        this.m = m;
        this.n = n;
        this.defaultValue = defaultValue;
    }
    /// set value at [i,j] (row, col)
    public void setValueAt(int i, int j, T value) {
        if (i >= m || j >= n || i < 0 || j < 0) 
            throw new IllegalArgumentException(
                    "index (" + i + ", " +j +") out of bounds");        
        data.put(i * n + j, value);
    }
    /// retrieve value at [i,j] (row, col)
    public T getValueAt(int i, int j) {
        if (i >= m || j >= n || i < 0 || j < 0) 
            throw new IllegalArgumentException(
                    "index (" + i + ", " +j +") out of bounds");
        T value = data.get(i * n + j);
        return value != null ? value : defaultValue;
    }
}

A simple test-case illustrating the SparseMatrix' use would be:

public class SparseMatrixTest extends TestCase {
    public void testMatrix() {
        SparseMatrix<Float> matrix = 
            new SparseMatrix<Float>(100000, 100000, 0.0F);
        matrix.setValueAt(1000, 1001, 42.0F);
        assertTrue(matrix.getValueAt(1000,1001) == 42.0);
        assertTrue(matrix.getValueAt(1001,1000) == 0.0);        
    }   
}

This is not the most efficient way of doing it because every non-default entry in the matrix is stored as an Object. Depending on the number of actual values you are expecting, the simplicity of this approach might trump integrating a 3rd-party solution (and possibly dealing with its License - again, depending on your situation).

Adding matrix-operations like multiplication to the above SparseMatrix implementation should be straight-forward (and is left as an exercise for the reader ;-)

VoidPointer