views:

128

answers:

2

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.

EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.

I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:

        word1, word2, word3, ..., word 150.000
person1   1      2      0            1
person2   0      0      1            0
...
person650

It contains the word counts for each person. Like this I am getting characteristic vectors for each person.

I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:

  • should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?

  • I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....

--> So in general I am asking for advice how to perform a svd on such a corpus.

+1  A: 

This is a big dense matrix. However, it is only a small a small sparse matrix.

Using a sparse matrix SVD algorithm is enough. e.g. here.

Yin Zhu
is it big and dense or small and sparse?
el chief
@ el. I mean you the matrix is stored in a dense, i.e. 2D array. It costs a lot of memory. However, I think the non-zero entries in the matrix can be safely stored in memory, thus a sparse SVD algorithm could apply on it.
Yin Zhu
YEah, he could definitely exploit the sparse svd algorithms if he only needs a few singular value/vector pairs. However, I really don’t understand why this is necessary. The described matrix is not very big at all.
SplittingField
@SF. you are right! This matrix is not big in its dense format too.
Yin Zhu
A: 

SVD is constrained by your memory size. See:

Folding In: a paper on partial matrix updates.

Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD

Steve
As described though, this really is not a big matrix so I don’t fully understand why the poster is getting into trouble...
SplittingField
I have checked up on my matrix again: the dimensions are 650 * 150.000, so I forgot a 0 :). The implementation I am using is a ruby wrapper around lapack, so maybe somewhere on the way I am getting that error. I also tried working on it with rapidminer which allows me a SVD on a matrix but it seems to have memory errors too. I was wondering in terms of wordcount simply dropping all the words that occour only once. THat would greatly reduce the dimension of the matrix. Anyway thank you for your help, I will have a look into SVDLIBC.
plotti
@plotti 650 by 150,000 is still not very big. As a single array of doubles, this requires around 650*150,000*8*(1/1024)*(1/1024) = 744 MB. This should still fit into memory (it does on my laptop).LAPACK can easily handle matrices of this size directly, however I am not certain how the Ruby wrapper works.If you provide some more information above, I can better help determine which algorithms you should be looking at.
SplittingField