views:

143

answers:

4

Hi

at first sorry for my poor grammar.

I want to build a simple hierarchical clustering algorithm in Java, so i need to build a similarity matrix, whose ij entry gives the similarity between i and j clusters.

The first thought is using int[][] for storing this matrix (each cluster has an ID of type integer).

I think that having for example initially 5000 clusters will lead to program memory crash, so any ideas for storing in another way this matrix? Maybe in another data structure?

Thank you

+1  A: 

2000 x 2000 isn't a great deal of memory these days so you could just do

int[][] = new int[2000][2000];

If some entries don't have a similarity entry, then perhaps you could exploit the sparseness and save some memory, but unless you have space constraints I don't think it's worth the effort.

Jeff Foster
A: 

Will it really crash? The matrix will hold 5000x5000 = 25 million int values. I still think, the JVM can handle this. You may need another hashtable to map from cluster to array index value, but that's not too much aswell. Just increase memory, a 32bit JVM can use 2GB RAM, that's plenty enough.

If you really need to calculate the similarity for all clusters, then each cell in the matrix will have a value and I think, there's no better data structure for the result.

Andreas_D
+1  A: 

25 million ints take roughly 100Mb of memory.

adding -Xmx256m switch when executing java should be enough if yoou're going the int[][] route.

if you don't need the presition of int, go with short to cut memory to 50M.

if most values are 0, you should definitely google for a sparse matrix implementation.

edit: if similarity(i,j) always equals similarity(j,i) you could use this to shave off half as well.

Buhb
.. because a two-dimensional array doesn't have to be rectangular, each row can have it's own length so you can easily create a triangular data structure with length(row) = length(r-1)-1 (+1 for that hint!)
Andreas_D
A: 

Is it going to have a lot zeros? If so, you need store all zeros. There are standards for storing sparse matrices such as Compressed Row Storage format

http://www.cs.utk.edu/~dongarra/etemplates/node373.html

srini.venigalla