views:

587

answers:

5

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):

class Thing {
    public double DistanceTo(Thing other);
}

I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:

List<Cluster> cluster(int numClusters, List<Thing> things);

Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?

Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.

Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.

+2  A: 

Here's a quick algorithm.

While (points_left > 0) {
 Select a random point that is not already clustered
 Add point and all points within x distance 
   that aren't already clustered to a new cluster.
}

Alternatively, read the wikipedia page. K-means clustering is a good choice:

The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.

The algorithm steps are:

* Choose the number of clusters, k.
* Randomly generate k clusters and determine the cluster centers, or
  directly generate k random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is
  met (usually that the assignment hasn't changed).

The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. Another disadvantage is the requirement for the concept of a mean to be definable which is not always the case. For such datasets the k-medoids variant is appropriate.

sysrqb
In your algorithm, how do you calculate x? How do you guarantee that n clusters will be created?
Frank Krueger
Oh, I somehow missed that you wanted n clusters. Use the k-means clustering that I edited in, then :)
sysrqb
"k clusters and determine the cluster centers", but I am not clustering points - just weird objects that have distances. But thanks for the additional effort.
Frank Krueger
He's specifically asking for something that doesn't require a centroid. K-means requires a notion of a "mean", or a centroid. You mention k-medoids here but only in the last line in pasted text from the wikipedia article.
tgamblin
+3  A: 

Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.

  1. Determine the distance between all objects. Record the n most separate objects.
    [finds roots of our clusters, time O(n^2)]
  2. Assign each of these n random points to n new distinct clusters.
  3. For every other object:
    [assign objects to clusters, time O(n^2)]
    1. For each cluster:
      1. Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
    2. Assign the object to the closest cluster.

This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.

Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.

Frank Krueger
Step 1 seems to be the worst part of this algorithm. I would like to adapt K-means adaptation strategy so that the initial clusters are chosen more logically. There must be edge cases where step 1 picks bad objects as roots.
Frank Krueger
How do you define the n most separate objects? Are they the objects with the highest average distance to all other objects?
MahlerFive
I've bounced with definitions. I've thought of taking the largest pairs, then grabbing the objects from those pairs. Your "highest average distance" sounds good to me too. But I feel that they both have flaws. "highest average distance" seems best in my mind, but requires more computation.
Frank Krueger
For the average, you just need to calculate the total, since you know the number of objects. Why not calculate/store the total as your object distances are being assigned? Then taking the average is merely a division per object, and then one pass through the list of objects can find the n largest.
MahlerFive
Agreed. But what if I have the scenario of two objects close to each other, then 100 objects evenly distributed far away from those. Say I want 4 clusters. Won't the "highest average" pick those two points and create 2 clusters right next to each other since both have very high average distances?
Frank Krueger
I think you want K-medoids. There are sampled variants that run faster than O(n^2) and are suitable for large numbers of objects. It also minimizes dissimilarity like K-means does. It's not clear to me that this algorithm does that -- seems like it depends strongly on your initial configuration.
tgamblin
A: 

How about this approach:

  1. Assign all objects to one cluster.
  2. Find the two objects, a and b, that are within the same cluster, k, and that are maximum distance apart. To clarify, there should be one a and b for the whole set, not one a and b for each cluster.
  3. Split cluster k into two clusters, k1 and k2, one with object a and one with object b.
  4. For all other objects in cluster k, add them to either k1 or k2 by determining the minimum average distance to all other objects in that cluster.
  5. Repeat steps 2-5 until N clusters are formed.

I think this algorithm should give you a fairly good clustering, although the efficiency might be pretty bad. To improve the efficiency you could alter step 3 so that you find the minimum distance to only the original object that started the cluster, rather than the average distance to all objects already in the cluster.

MahlerFive
+1  A: 

Phylogenetic DNA sequence analysis regularly uses hierarchical clustering on text strings, with [alignment] distance matrices. Here's a nice R tutorial for clustering:

(Shortcut: Go straight to the "Hierarchical Agglomerative" section...)

Here are some other [language] libraries :

This approach could help determine how many [k] "natural" clusters there are and which objects to use as roots for the k-means approaches above.

bubaker
+3  A: 

I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.

Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.

Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:

  1. PAM is the basic O(n^2) implementation of K-medoids.
  2. CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.

If you need more information, here's a paper that gives an overview of these and other K-medoids methods.

tgamblin
Excellent information, thank you very much.
Frank Krueger