views:

169

answers:

4

I wrote a k-Means clustering algorithm in MATLAB, and I thought I'd try it against MATLABs built in kmeans(X,k).

However, for the very easy four cluster setup (see picture), MATLAB kMeans does not always converge to the optimum solution (left) but to (right).

The one I wrote does not always do that either, but should not the built-in function be able to solve such an easy problem, always finding the optimal solution?

alt text

+2  A: 

The k-means algorithm is quite sensitive to initial guess for the cluster centers. Did you try both codes with the same initial mass centers ?

The algorithm is simple, and I doubt there is much variation between your implementation and Matlab's.

Alexandre C.
I guess you are right. To get better convergence i instead modified my algoritm to do several attempts with different initial mass centers, and then determining the best one by measuring the variance of the clusters. Low mean variance in an attempt equals compact clusters. How does that sound?
Theodor
@Theodor: This is one possible criterion indeed.
Alexandre C.
+4  A: 

As @Alexandre C. explained, the K-means algorithm depends on the initial cluster centroid positions, and there is no guarantee that it will converge to the optimal solution.

The best you can do is to repeat the experiment several times with random starting points.

MATLAB's implementation offers such an option: replicates which repeats the clustering N times and pick the one with the lowest total within-cluster point-to-centroid distances. You get also to control how the initial centroids are chosen with the start option.

In addition, MATLAB provides the choice among a number of distance measures (Euclidean, Manhattan, Cosine, ...). One neat option emptyaction allows you to control what happens when a cluster loses all its assigned member during the iterations.

But the real advantage is that it employs a two-phase algorithm: the usual assign-recompute iterations, followed by an online update phase. Be sure to read the algorithm section of the documentation page for more information.

Amro
+1  A: 

You will probably often be disappointed by the solution that any particular run of "the k-means algorithm" (i.e. Lloyd's algorithm) comes up with. That's because Lloyd's algorithm often gets stuck in bad local minima.

Luckily, Lloyd's is just one way of solving k-means. And there's an approach that almost always finds a better local minima.

The trick is to update the cluster assignments of the data points one at a time. You can do this efficiently by keeping a count of the number of points n assigned to each mean. So that you can re-calculate the cluster mean m after removing the point x as follows:

m_new = (n * m - x) / (n - 1)

And add x to cluster mean m using:

m_new = (n * m + x) / (n + 1)

Of course because it can't be vectorized, it's a bit painful to run in MATLAB but not too bad in other languages.

If you are really keen on getting the best possible local minima, and you don't mind using exemplar-based clustering, you should look at affinity propagation. MATLAB implementations are available on the Frey lab affinity propagation page.

qdjm
A: 

Although K-Means++ won't solve the problem in a single run, it tends to give better results when it is run N times (compared to running the original K-Means algorithm N times).

rwong
Yeah, I read about the kMeans++ algo on wiki but I didn't really understand how the initial clusters where initiated..
Theodor
Try this link. http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
rwong