views:

203

answers:

5

Given: Given a set of N points in the 2D plane (x and y coordinates), and a set of N radii corresponding to each point. We will refer to a point's disc as the disc centered at the point with its radius.

Problem: Cluster the points. A cluster of points is such that each point EITHER falls within the disc of at least one other point in the cluster OR at least one other point in the cluster falls within its disc. Clusters of single points do not count as clusters.

I need an algorithm to compute this efficiently (preferably without resorting to complicated spatial hashing techniques like kd-trees). I can settle for O(N^2) run time but definitely no more than O(N^3). I feel like this should be simple but I'm getting caught up on the non-reciprocal nature of my clustering condition.

In essence, I am looking to fill in the C function prototype:

int cluster_points(
    size_t n,
    point_t *pt, // length n list of points
    double *radius, // length n list of radii for each point
    int *cluster, // length n list of cluster indexes; -1 if not clustered
    size_t *ncluster // number of clusters (number of distinct non -1 entries in cluster)
);

This is not homework. I need this as part of a matrix algorithm for clustering complex numbers.

+2  A: 

k-means clustering based on a combination of local search and Lloyd's algorithm

http://www.cs.umd.edu/~mount/Projects/KMeans/
(Program is distributed under conditions of the GNU General Public License.)

k-means, k-medians, k-medoids, treecluster, self-organizing maps, clustercentroids, clusterdistance http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster.pdf

Quandary
k-means is not an option. I want the precise inclusion metric that I described above.
Victor Liu
+1  A: 

Clustering is an NP-Hard problem even if you are given the number of clusters a priori, so you can probably give up on getting a polynomial run time. There are many many techniques to do this and the literature is mainly found in the machine learning community, k-means is probably the easiest algorithm to understand and implement.

anonymous_21321
I don't think I mean clustering then. You can imagine that if all the radii are the same, you can loop over all pairs and determine if the pairs should be in a single cluster. This defines an incidence matrix of the graph of cluster relationships. After that you only need to find the connected components of the graph (I'm not looking for cliques). I believe all this is polynomial time.
Victor Liu
Ok, this helps a bit but I'm still not 100% clear on what you are looking for. I am primarily wondering how you want to limit the size or number of clusters, i.e. what is stopping me from always declaring that all points are in the same cluster (assuming it is fully connected).
anonymous_21321
Ok, picture this geometrically. I have a bunch of points in the plane, and each point has a designated radius, so really I'm placing a bunch of discs in the plane. I want to find all the groups of discs that are touching; the clusters are points for which their discs overlap to form a larger non-circular region in the plane (possibly with holes). There is no limit to the size or number of clusters; if the radii are all zero, then there are no clusters. If the radii are all huge, then there is only one cluster.
Victor Liu
Oh I see, this is a range search. There is clearly an O(n^2) algorithm by simply iterating over each pair of points. If you want to be faster you have to use more complicated data structures (kd-trees).
anonymous_21321
+1  A: 

It sounds like the obvious O(n^2) algorithm would be to create a graph with the points as vertices, and then connect two points if the conditions you give are met. And then you read off the connected components of the graph, discarding singletons. Also, the condition you gave for clustering sounds symmetric to me. Am I missing something?

brainjam
I was hoping to do this in O(1) space, but doing the obvious thing might just be more straightforward. The cluster condition is symmetric, but for a double loop implementation, it's not.
Victor Liu
+2  A: 

The brute force solution is only O(N2), so it should work for you. 1) Start with all of the points in the unassigned group. 2) Pick one point and look at all the others in the unassigned group and see whether the meet the radii criterion you describe. 2a) If yes, start a group, and continue on with all the other points to see if they fit in this group (based on your current point), and if they fit in, move them from the unassigned group into this new one. 2ab) When done with the current point, go to each point added to the group and check all the points in the unassigned group. 2b) But, if no point matches the radii criteria for the current point, discard it. 3) At the end you will have grouped the points by your criteria, and will have done no more than N*(N/2) inspections.

Btw, what you describe is not what's normally meant by "clustering", so I think that's throwing people off here. What makes clustering a difficult problem is that the question of whether two neighboring points will be assigned to the same cluster is determined by all the other points in the data set. In your case, it's (basically) only determined by properties of the two points, so that in your case, you can just check them all.

tom10
In step 2ab, don't you have to keep going to all the added points and looking through the rest of the unassigned points until nothing new gets assigned?
Victor Liu
@VictorLiu: I don't think so. With this method, once you've established group 1 (and moved onto group 2), group 1 contains all of the points that fit the criteria and will ever be in this group. When defining group 1, every time you add a point to it you sweep though all of the other points to see if they might now be in the group. There's no chance that, say, you'd need to join group 1 and group 2, once group 1 has be established.
tom10
+1  A: 

You have a collection U of pairs (p,R) where p is a point and R its radius.

The relation ~ on U : (p,R) ~ (q,S) <=> p lies in q's disc or q lies in p's disc <=> |p-q| <= max(R,S)

is clearly reflexive and symmetric and so it's transitive closure (~, say) is an equivalence relation. The equivalence classes under ~ will be (singletons or) clusters.

I belive there are standard algorithms to compute the equivalence classes of the transitive closure of a relation like ~ above. For example this is discussed in Numerical Recipes in the chapter on sorting, and they say that their routine is base on Knuth.

(Sorry not to provide a link but a brief search didn't come up with exactly the right thing).

dmuir