views:

379

answers:

2

Hey, folks.

So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.

I've gone through this tutorial on how to create sparse matrices: http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7

To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.

Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.

I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.

Any advice?

Thank you very much!

A: 

You could create a subclass of one of the existing 2d sparse arrays

from scipy.sparse import dok_matrix

class sparse1d(dok_matrix):
    def __init__(self, v):
        dok_matrix.__init__(self, (v,))
    def dot(self, other):
        return dok_matrix.dot(self, other.transpose())[0,0]

a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)
gnibbler
Unfortunately, the issue with that is that you have to transpose the dang things on the fly, which doesn't make a lot of sense when you're doing millions of comparisons. I tried caching the dot products, but unfortunately, we don't do the same dot products very often, so that didn't help much.
spitzanator
A: 

I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:

Asp.multiply(Bsp).sum()

This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.

However, it is probably just easier to tranpose them:

Asp*Bsp.T

That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.

Justin Peel
I also tried, for a vector [1, 2, 3], creating a matrix:[1, 2, 3][2, 0, 0][3, 0, 0]Taking two of these and multiplying (in any order) gives the desired dot product in the top left of the result matrix. Unfortunately, this severely negatively-impacted speed.
spitzanator