Online k-means clustering | ansaurus

tags:

views:

92

answers:

1

+1 Q:

Online k-means clustering

Is there a online version of the k-Means clustering algorithm?

By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time.

I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis.

Also, does anyone have advice for other online clustering algorithms? (lmgtfy failed ;))

+2 A:

Yes there is. Google failed to find it because it's more commonly known as "sequential k-means".

You can find two pseudo-code implementations of sequential K-means in this section of some Princeton CS class notes by Richard Duda. I've reproduced one of the two implementations below:

Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
    Acquire the next example, x
    If mi is closest to x
        Increment ni
        Replace mi by mi + (1/ni)*( x - mi)
    end_if
end_until

The beautiful thing about it is that you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can throw away the data point.

I'm not sure where you would be able to find a citation for it. I would start looking in Duda's classic text Pattern Classification and Scene Analysis or the newer edition Pattern Classification. If it's not there, you could try Chris Bishop's newest book or Daphne Koller and Nir Friedman's recent text.

qdjm 2010-09-14 07:24:30

Thank you. That made all the difference.

Theodor 2010-09-14 08:55:54

related questions

What's the best way to manage php sessions in an LAMP cluster?

What is the best solution for storing ASP.NET session variables? StateServer or SQLServer?

How to manage session variables in a web cluster?

Scaling solutions for MySQL (Replication, Clustering)

Can Database Mirroring Be Setup On MS SQLServer Between Two Clusters

In an Oracle cluster will sysdate always return a consistent answer?

MPI or Sockets

Deactivating Weblogic Load Balancing Optimization for collocated objects

Generating 'neighbours' for users based on rating

Are "dirty reads" safe to use in Terracotta?

How do I visualise clusters of users?

Distributed Concurrency Control

Distributed hierarchical clustering

How to Setup a Low cost cluster

Can you have more than one ASP.NET State Server Service in a cluster?

Is there a python package to interface with MS Cluster ?

How Do You Categorize Based On Text Content?

Spread vs MPI vs zeromq?

Log files in massively distributed systems

experience with java clustering ?

Looking for terracotta examples

Should you run one or multiple applications per tomcat cluster?

How do I cluster an upload folder with ASP.Net?

FOSS ASP.Net Session Replication Solution?

What is called a Node in a WebSpere Network Deployment