views:

1500

answers:

7

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.

+3  A: 

From wikipedia, you could use scipy, K-means clustering an vector quantization

Or, you could use a Python wrapper for OpenCV, ctypes-opencv.

Or you could OpenCV's new Python interface, and their kmeans implementation.

Jacob
A: 

You can also use GDAL, which has many many functions to work with spatial data.

George
+2  A: 

By the way, K-Means is a very simple algorithm to implement. It's an educational experience to implement it yourself.

ldog
Yes, I agree and I will definitely do that.
Eeyore
+6  A: 

Scipy's clustering implementations work well, and they include a k-means implementation. There's a scipy kmeans example here.

There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.

tom10
+2  A: 

SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.

For now, I would recommend using PyCluster instead. Example usage:

>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean, 
                                                            0.03 * numpy.diag([1,1]),
                                                            20) 
                           for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels  # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error   # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound  # Number of times this solution was found
1
Vebjorn Ljosa
A: 

For continuous data, k-means is very easy.

You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.

I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is

k*(1-(1/n)) + n*(1/n)

Here is the full code in Python

from __future__ import division
from random import random

# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]

param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1

for x in data:
    closest_k = 0;
    smallest_error = 9999; # this should really be positive infinity
    for k in enumerate(means):
        error = abs(x-k[1])
        if error < smallest_error:
            smallest_error = error
            closest_k = k[0]
        means[closest_k] = means[closest_k]*(1-param) + x*(param)

you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!

Nathan
+1  A: 

Dear all,

By using SciPy's kmeans2() I obtained the mentioned error of ""Matrix is not positive definite - Cholesky decomposition cannot be computed". However, when using SciPy's kmeans() with the same points and number of clusters, the error doesn't appear. Does anybody has an idea which are the differences between kmeans vs. kmeans2?

BTW, Vebjorn Ljosa's solution consisted in using "Pycluster.kcluster". However, by haveing a look at the post, I was wondering where do the centroids are returned? => labels, error, nfound = Pycluster.kcluster(points, 3).

Best wishes

Javier