views:

383

answers:

2

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects.

What method do you use to determine the number of clusters in the data in k-means clustering?

Does any package available in R contain the V-fold cross-validation method for determining the right number of clusters?

Another well used approach is Expectation Maximization (EM) algorithm which assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters.

Is this algorithm implemented in R?

If it is, does it have the option to automatically select the optimum number of clusters by cross validation?

Do you prefer some other clustering method instead?

+2  A: 

For large "sparse" datasets i would seriously recommend "Affinity propagation" method. It has superior performance compared to k means and it is deterministic in nature.

http://www.psi.toronto.edu/affinitypropagation/ It was published in journal "Science".

However the choice of optimal clustering algorithm depends on the data set under consideration. K Means is a text book method and it is very likely that some one has developed a better algorithm more suitable for your type of dataset/

This is a good tutorial by Prof. Andrew Moore (CMU, Google) on K Means and Hierarchical Clustering. http://www.autonlab.org/tutorials/kmeans.html

A: 

Last week I coded up such an estimate-the-number-of-clusters algorithm for a K-Means clustering program. I used the method outlined in:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf

My biggest implementation problem was that I had to find a suitable Cluster Validation Index (ie error metric) that would work. Now it is a matter of processing speed, but the results currently look reasonable.

winwaed