views:

173

answers:

4

I'm doing some tests clustering a big number of very large sparse vectors representing term-frequency-inverse-document-frequency of various hypertextual documents. What algorithm would you suggest for clustering this data, taking into account the proportions of the data set? The dimension of the vectors would be > 3·105 and the number of vectors could be around 109. I've taken a look at dbscan and optics algorithms. The number of clusters is not known a priory. And a spatial index with such high dimensionality seems complicated.

+2  A: 

I've had almost as good of results with a simple K-means clustering as almost anything else, and it's definitely faster than most of the alternatives. I've also gotten good results with pair-wise agglomeration, but it's quite a bit slower. For K-means, you do have to start with some estimated number of clusters, but you can adjust it algorithmically as you go along. If you find two clusters with means that are too close together, you reduce the number of clusters. If you find clusters with too large a range of variation, you try more clusters. I've found sqrt(N) to be a reasonable starting point -- but I'm usually starting with more like 10^7 documents rather than 10^9. For 10^9, it might make sense to reduce that somewhat.

If it were up to me, however, I'd think very hard about starting by reducing the dimensionality with something like Landmark MDS, then doing the clustering.

Jerry Coffin
K-Means should **always** be the first segmentation technique you try when trying to cluster *anything*. Is simple, efficient and provides excellent results most of the time. The only downside is having to choose an appropriate value of K. You can always try an increasing sequence of K's calculating your intercluster variance as a criteria for the quality of the clustering. This however, does not work that well in practice.
ldog
+1  A: 

Decision trees are popular for working efficiently in high-dimensional spaces.Check out Clustering Via Decision Tree Construction.

Also, Randomized Forests are extremely efficient learners and an OpenCV implementation exists if you want to play with it.

Jacob
+1  A: 

I hear that semantic hashing achieves excellent results. However, deep belief nets are prettyhard to implement. You might want to try min hashing (that's a probabilistic approach, though) or locality sensistive hashing for euclidean spaces.

In general, clustering in such high dimensional spaces is difficult due to the curse of dimensionalty and the fact that most items have similar distances to each other. Standard approaches like K-Means might work if you reduce the dimensionality beforehand via SOMs or PCA.

bayer
Thanks for the interesting links.
piotr
+1  A: 

When clustering data I would always try at least these two algorithms in this order:

  1. K-Means : try tweaking the results as much as possible. If you can get K-Means to work for you and provide decent results you will almost surely not do better when any other algorithm.

  2. Expectation Maximization: the K-means algorithm was actually developed to be a cheap and good alternative to the EM algorithm. The EM algorithm is more complex to understand and more expensive to compute, but the results from EM are excellent. You can learn more about EM http://en.wikipedia.org/wiki/Expectation-maximization%5Falgorithm . There is an OpenCv implementation of EM : http://opencv.willowgarage.com/documentation/expectation-maximization.html

If the results of neither of these two are satisfactory, I would start looking elsewhere, but not until you have tried both.

ldog
Isn't K-Means an instance of EM?
bayer
@bayer: No they are most certainly not the same algorithm if thats what you mean. K-Means is non-parametric but EM is (meaning EM asserts that there is an underlying multi-variate gaussian distribution for the data which is not a very stringent assumption if you consider the central limit theorem.) From what I understand, the EM algorithm sometimes gets grouped as a meta-algorithm where other algorithms fall under it. It can actually be implemented stand alone from what I've seen.
ldog