views:

157

answers:

3
+2  Q: 

Clustering problem

Hi all

I've been tasked to find N clusters containing the most points for a certain data set given that the clusters are bounded by a certain size. Currently, I am attempting to do this by plugging in my data into a kd-tree, iterating over the data and finding its nearest neighbor, and then merging the points if the cluster they make does not exceed a limit. I'm not sure this approach will give me a global solution so I'm looking for ways to tweak it. If you can tell me what type of problem this would go under, that'd be great too.

+3  A: 

Check out scipy.clustering for a start. Key word searches can then give a lot of info on the different algorithms that are used there. Clustering is a big field, with a lot of research and practical applications, and a number of simple approaches that have been found to work fairly well, so you may not want to start by rolling your own.

This said, clustering algorithms are generally fairly easy to program, and if you do want to program your own, k-means and agglomerative clustering are some of the favorites that are quick to do.

Finally, I'm not sure that your idea of exactly N clusters that are bounded by a certain size is self-consistent, but it depends on exactly what you mean by "size" and "cluster" (are single points a cluster?).

Update:

Following the OP's comments below, I think that the standard clustering methods won't give an optimal solution to this problem because there's not a continuous metric for the "distance" between points that can be optimized. Although they may give a good solution or approximation in some cases. For a clustering approach I'd try k-means since the premise of this method is having a fixed N.

But instead of clustering, this seems more like a covering problem (i.e., you have N rectangles of fixed size and you're trying to cover all of the points with them), but I don't know much about these, so I'll leave it to someone else.

tom10
I should have been more specific. The attributes of a certain data point are cartesian coordinates. The ' physical size' of the cluster is determined by the attributes of a data point which would allow me to calculate an area.
jlv
it's still not clear... e.g. you calculate the size of a cluster (of many data points, I assume), but the attributes of a single data point? Usually by size people mean either 1) the number of data points, or 2) the area in, say, Cartesian space. You mean something else? Also, is there are reason why none standard clustering methods will work for you?
tom10
I'm not familiar enough with the realm of clustering methods to know what will work for me. As for the data, each point is an (x,y) coordinate, and I use those points to define a rectangle for a cluster. So the min x, min y for in the data points for a cluster would represent the bottom-left point of a rectangle. the max x and max y in the data point would present the top - right point of a rectangle. I hope that makes sense
jlv
@jlv : can you give a simple example with the input and wanted output ? In cartesian coordinates it's not very common to use rectangles.
Loïc Février
Sure, that all sounds reasonable. What I don't understand then, is what if your points span a bigger area than N times the maximum bounding area? In this case, for example, there won't be a way fulfill the criteria.
tom10
My input is a set of coordinates (x, y). My desired output is N rectangles with the highest number of coordinates in them while each rectangle is below a certain area.
jlv
Also, consider the example of 4 pts, (0,0), (10,10), (11,11), (22,22). What boxes would you like then? It seems if you wanted N=2, and minimal largest box you'd group ((0,0), (10,10)) and ((11,11), (22,22)). Is this what you'd like?
tom10
Tom, to your previous comment, I fully expect that N * maximum bounding area won't span the area of the points. I just want to select boxes that have as many data-points in them while not overlapping and bound by a certain area. To your most recent comment, that's exactly what I want - maximize the number of points in boxes
jlv
@jlv: OK. The update in my answer lists about all I have to say for this then.
tom10
Thanks tom! Half the battle is terminology, and I feel like I merely just didn't know how to properly describe my problem.
jlv
Sure, I hope it helps, though it is outside of what I know about, so you should only take it as a suggestion and not as any kind of fact.
tom10
A: 

If your number of clusters is fixed and you only want to maximize the number of points that are in these clusters then I think a greedy solution would be good :

  • find the rectangle that can contains the maximum number of points,
  • remove these points,
  • find the next rectangle
  • ...

So how to find the rectangle of maximum area A (in fact each rectangle will have this area) that contains the maximum number of points ?

A rectangle is not really common for euclidean distance, before trying to solve this, could you precise if you really need rectangle or just some king of limit on the cluster size ? Would a circle/ellipse work ?

EDIT : greedy will not work (see comment below) and it really need to be rectangles...

Loïc Février
Unfortunately, it seems like I do have to use rectangles. Part of the spec. Do you have any advice how to find simply one rectangle with the maximum number of points?
jlv
I don't think a greedy solution will work. Consider, for example, a square of 4 pts, length 1, centered within a square of 4 pts, length 2, and 4 bounding regions with length 1. The first greedy boundary will take the 4 pts of the smaller square, leaving 3 boundaries for the remaining larger outer square's 4 pts. The correct solution here, is, of course, one boundary for the two points in of each square's corners.
tom10
(And I apologize for being a critic when I'm not really offering a solution of my own.)
tom10
Don't apologize, glad to be proven wrong. Greedy seemed to good to be true bu not counter-example came to my mind. Thanks !
Loïc Février
@jlv : can you give some more infos ? Max coordinates of points, typical value for N, max number of points ?
Loïc Février
Actually, maybe I was wrong here... I'm assuming that the bounding shapes are squares, but if they can be rectangles, then my example isn't a valid counter example since two (even extremely thin rectangles) could grab the other 4 pts of the outer square?
tom10
The coordinates will be latitude and longitude, so I'm writing this to span those features. N is 5, I'm expecting the number of points to be 50,000? Man, using rectangles makes this REALLY computationally expensive. Each time I want to merge two rectangles, I have to make sure that the new rectangle doesn't intersect any existing rectangle.
jlv
With this many points it's really worth trying k-means with N=5. As a first pass this may give a sufficient solution, and then if it doesn't, maybe just shifting a few points would solve it. It's easy to do with the scipy kmeans implementation. It's not ideal for this problem, and kmeans itself is just a heuristic and not optimal, but it's probably a good start.
tom10
A: 

link textActually, I think this is really pretty easy with two key assumptions.

1) Assume the by "a certain size" we can say "any cluster must be contained completely within a circle with radius, r".

2) All your points are candidate "seed" points at the center of the cluster.

First calculate all the distances less than r among all points. Now solve a set covering problem using only the feasible edges that are less than r. If any point has a nearest neighbor greater than r distance away, it forms its own cluster.

Grembo