ansaurus

Question

Answer 1

A:

Try this:

def updateCentroid(self, label):

    self.clusters[label].centroid.x = numpy.array([point.x for point in self.clusters[label].points]).mean()
    self.clusters[label].centroid.y = numpy.array([point.y for point in self.clusters[label].points]).mean()

hugo24 2010-09-27 12:28:57

Answer 2

+3 A:

Why not avoid constructing the extra arrays?

def updateCentroid(self, label):
  sumX=0; sumY=0
  N = len( self.clusters[label].points)
  for point in self.clusters[label].points:
    sumX += point.x
    sumY += point.y
  self.clusters[label].centroid.x = sumX/N
  self.clusters[label].centroid.y = sumY/N

Michael Anderson 2010-09-27 12:35:12

Answer 3

+5 A:

A K-means algorithm is already implemented in scipy.cluster.vq. If there is something about that implementation that you are trying to change, then I'd suggest start by studying the code there:

In [62]: import scipy.cluster.vq as scv
In [64]: scv.__file__
Out[64]: '/usr/lib/python2.6/dist-packages/scipy/cluster/vq.pyc'

PS. Because the algorithm you posted holds the data behind a dict (self.clusters) and attribute lookup (.points) you are forced to use slow Python looping just to get at your data. A major speed gain could be achieved by sticking with numpy arrays. See the scipy implementation of k-means clustering for ideas on a better data structure.

unutbu 2010-09-27 12:37:37

Yes, my implementation is different from the standard algoritm, since it is updated online. Howver though, I will admit that storing the points in an array instead of the existing structure might be a good idea. I will have a look at how the update step is implemented in scipy.cluster.vq.

Theodor 2010-09-27 12:45:49

Answer 4

+1 A:

Without extra lists:

def updateCentroid(self, label):
    self.clusters[label].centroid.x = numpy.fromiter(point.x for point in self.clusters[label].points, dtype = np.float).mean()
    self.clusters[label].centroid.y = numpy.fromiter(point.y for point in self.clusters[label].points, dtype = np.float).mean()

eumiro 2010-09-27 12:41:20

Answer 5

+1 A:

Perhaps the added features of numpy's mean are adding a bit of overhead.

>>> def myMean(itr):
...   c = t = 0
...   for item in itr:
...     c += 1
...     t += item
...   return t / c
...
>>> import timeit
>>> a = range(20)
>>> t1 = timeit.Timer("myMean(a)","from __main__ import myMean, a")
>>> t1.timeit()
6.8293311595916748
>>> t2 = timeit.Timer("average(a)","from __main__ import a; from numpy import average")
>>> t2.timeit()
69.697283029556274
>>> t3 = timeit.Timer("average(array(a))","from __main__ import a; from numpy import average, array")
>>> t3.timeit()
51.65147590637207
>>> t4 = timeit.Timer("fromiter(a,npfloat).mean()","from __main__ import a; from numpy import average, fromiter,float as npfloat")
>>> t4.timeit()
18.513712167739868

Looks like numpy's best performance came when using fromiter.

MattH 2010-09-27 12:50:54

The overhead is from converting an iterator to an array... `numpy.mean` is _very_ fast when operating on an already created arrays. (Compare your results if you use `a = numpy.arange(20)` instead of the builtin `range`)

Joe Kington 2010-09-27 14:31:55

Answer 6

A:

That's the problem with profilers that only tell you about functions. This is the method I use, and it pinpoints costly lines of code, including points where functions are called.

That said, there's a general idea that data structure is free. As @Michael-Anderson asked, why not avoid making an array? That's the first thing I saw in your code, that you're building arrays by appending. You don't need to.

Mike Dunlavey 2010-09-27 12:54:42

Answer 7

+3 A:

The costly part of your function is most certainly the iteration over the points. Avoid it altogether by making self.clusters[label].points a numpy array itself, and then compute the mean directly on it. For example if points contains X and Y coordinates concatenated in a 1D array:

points = self.clusters[label].points
x_mean = numpy.mean(points[0::2])
y_mean = numpy.mean(points[1::2])

Luper Rouch 2010-09-27 13:08:47

Yes I agree that this is the way to go. A stupid question though; how do I append items to a numpy array? Something like points.append([x,y]) does not seem to work.

Theodor 2010-09-27 13:59:40

@Theodor - numpy arrays aren't designed to be quickly appended to. (You can use `numpy.append`, but it's slow) The best option (when you don't know the final size) is to build a list and then convert it to a numpy array at the end.

Joe Kington 2010-09-27 14:27:36

Answer 8

A:

One way to go is add an x_sum and y_sum to your "clusters" object and sum the coordinates as points are added. If things are moving around, you can also update the sum as points move. Then getting the centroid is just a matter of dividing the x_sum and y_sum by the number of points. If your points are numpy vectors that can be added, then you don't even need to sum the components, just maintain a sum of all the vectors and multiply be 1/len at the end.

phkahler 2010-09-27 19:53:38

Answer 9

+1 A:

Ok, I figured out a moving average solution which is fast without changing the data structures:

def updateCentroid(self, label):
    cluster = self.clusters[label]
    n = len(cluster.points)
    cluster.centroid.x = ((n-1)*cluster.centroid.x + cluster.points[n-1].x)/n
    cluster.centroid.y = ((n-1)*cluster.centroid.y + cluster.points[n-1].y)/n

This lowered computation time (for the whole k means algorithm) to 13% of original. =)

Thank you all for some great insight!

Theodor 2010-09-28 08:50:40

ansaurus

tags:

views:

answers:

Optimizing mean in python

related questions