views:

193

answers:

9

I have a function which updates the centroid (mean) in a K-means algoritm. I ran a profiler and noticed that this function uses a lot of computing time.

It looks like:

def updateCentroid(self, label):
    X=[]; Y=[]
    for point in self.clusters[label].points:
        X.append(point.x)
        Y.append(point.y)
    self.clusters[label].centroid.x = numpy.mean(X)
    self.clusters[label].centroid.y = numpy.mean(Y)

So I ponder, is there a more efficient way to calculate the mean of these points? If not, is there a more elegant way to formulate it? ;)

EDIT:

Thanks for all great responses! I was thinking that perhaps I can calculate the mean cumulativly, using something like: alt text

where x_bar(t) is the new mean and x_bar(t-1) is the old mean.

Which would result in a function similar to this:

def updateCentroid(self, label):
    cluster = self.clusters[label]
    n = len(cluster.points)
    cluster.centroid.x *= (n-1) / n
    cluster.centroid.x += cluster.points[n-1].x / n
    cluster.centroid.y *= (n-1) / n
    cluster.centroid.y += cluster.points[n-1].y / n

Its not really working but do you think this could work with some tweeking?

A: 

Try this:

def updateCentroid(self, label):

    self.clusters[label].centroid.x = numpy.array([point.x for point in self.clusters[label].points]).mean()
    self.clusters[label].centroid.y = numpy.array([point.y for point in self.clusters[label].points]).mean()
hugo24
+3  A: 

Why not avoid constructing the extra arrays?

def updateCentroid(self, label):
  sumX=0; sumY=0
  N = len( self.clusters[label].points)
  for point in self.clusters[label].points:
    sumX += point.x
    sumY += point.y
  self.clusters[label].centroid.x = sumX/N
  self.clusters[label].centroid.y = sumY/N
Michael Anderson
+5  A: 

A K-means algorithm is already implemented in scipy.cluster.vq. If there is something about that implementation that you are trying to change, then I'd suggest start by studying the code there:

In [62]: import scipy.cluster.vq as scv
In [64]: scv.__file__
Out[64]: '/usr/lib/python2.6/dist-packages/scipy/cluster/vq.pyc'

PS. Because the algorithm you posted holds the data behind a dict (self.clusters) and attribute lookup (.points) you are forced to use slow Python looping just to get at your data. A major speed gain could be achieved by sticking with numpy arrays. See the scipy implementation of k-means clustering for ideas on a better data structure.

unutbu
Yes, my implementation is different from the standard algoritm, since it is updated online. Howver though, I will admit that storing the points in an array instead of the existing structure might be a good idea. I will have a look at how the update step is implemented in scipy.cluster.vq.
Theodor
+1  A: 

Without extra lists:

def updateCentroid(self, label):
    self.clusters[label].centroid.x = numpy.fromiter(point.x for point in self.clusters[label].points, dtype = np.float).mean()
    self.clusters[label].centroid.y = numpy.fromiter(point.y for point in self.clusters[label].points, dtype = np.float).mean()
eumiro
+1  A: 

Perhaps the added features of numpy's mean are adding a bit of overhead.

>>> def myMean(itr):
...   c = t = 0
...   for item in itr:
...     c += 1
...     t += item
...   return t / c
...
>>> import timeit
>>> a = range(20)
>>> t1 = timeit.Timer("myMean(a)","from __main__ import myMean, a")
>>> t1.timeit()
6.8293311595916748
>>> t2 = timeit.Timer("average(a)","from __main__ import a; from numpy import average")
>>> t2.timeit()
69.697283029556274
>>> t3 = timeit.Timer("average(array(a))","from __main__ import a; from numpy import average, array")
>>> t3.timeit()
51.65147590637207
>>> t4 = timeit.Timer("fromiter(a,npfloat).mean()","from __main__ import a; from numpy import average, fromiter,float as npfloat")
>>> t4.timeit()
18.513712167739868

Looks like numpy's best performance came when using fromiter.

MattH
The overhead is from converting an iterator to an array... `numpy.mean` is _very_ fast when operating on an already created arrays. (Compare your results if you use `a = numpy.arange(20)` instead of the builtin `range`)
Joe Kington
A: 

That's the problem with profilers that only tell you about functions. This is the method I use, and it pinpoints costly lines of code, including points where functions are called.

That said, there's a general idea that data structure is free. As @Michael-Anderson asked, why not avoid making an array? That's the first thing I saw in your code, that you're building arrays by appending. You don't need to.

Mike Dunlavey
+3  A: 

The costly part of your function is most certainly the iteration over the points. Avoid it altogether by making self.clusters[label].points a numpy array itself, and then compute the mean directly on it. For example if points contains X and Y coordinates concatenated in a 1D array:

points = self.clusters[label].points
x_mean = numpy.mean(points[0::2])
y_mean = numpy.mean(points[1::2])
Luper Rouch
Yes I agree that this is the way to go. A stupid question though; how do I append items to a numpy array? Something like points.append([x,y]) does not seem to work.
Theodor
@Theodor - numpy arrays aren't designed to be quickly appended to. (You can use `numpy.append`, but it's slow) The best option (when you don't know the final size) is to build a list and then convert it to a numpy array at the end.
Joe Kington
A: 

One way to go is add an x_sum and y_sum to your "clusters" object and sum the coordinates as points are added. If things are moving around, you can also update the sum as points move. Then getting the centroid is just a matter of dividing the x_sum and y_sum by the number of points. If your points are numpy vectors that can be added, then you don't even need to sum the components, just maintain a sum of all the vectors and multiply be 1/len at the end.

phkahler
+1  A: 

Ok, I figured out a moving average solution which is fast without changing the data structures:

def updateCentroid(self, label):
    cluster = self.clusters[label]
    n = len(cluster.points)
    cluster.centroid.x = ((n-1)*cluster.centroid.x + cluster.points[n-1].x)/n
    cluster.centroid.y = ((n-1)*cluster.centroid.y + cluster.points[n-1].y)/n

This lowered computation time (for the whole k means algorithm) to 13% of original. =)

Thank you all for some great insight!

Theodor