Clustering problem

Check out scipy.clustering for a start. Key word searches can then give a lot of info on the different algorithms that are used there. Clustering is a big field, with a lot of research and practical applications, and a number of simple approaches that have been found to work fairly well, so you may not want to start by rolling your own.

This said, clustering algorithms are generally fairly easy to program, and if you do want to program your own, k-means and agglomerative clustering are some of the favorites that are quick to do.

Finally, I'm not sure that your idea of exactly N clusters that are bounded by a certain size is self-consistent, but it depends on exactly what you mean by "size" and "cluster" (are single points a cluster?).

Update:

Following the OP's comments below, I think that the standard clustering methods won't give an optimal solution to this problem because there's not a continuous metric for the "distance" between points that can be optimized. Although they may give a good solution or approximation in some cases. For a clustering approach I'd try k-means since the premise of this method is having a fixed N.

But instead of clustering, this seems more like a covering problem (i.e., you have N rectangles of fixed size and you're trying to cover all of the points with them), but I don't know much about these, so I'll leave it to someone else.

If your number of clusters is fixed and you only want to maximize the number of points that are in these clusters then I think a greedy solution would be good :

find the rectangle that can contains the maximum number of points,
remove these points,
find the next rectangle
...

So how to find the rectangle of maximum area A (in fact each rectangle will have this area) that contains the maximum number of points ?

A rectangle is not really common for euclidean distance, before trying to solve this, could you precise if you really need rectangle or just some king of limit on the cluster size ? Would a circle/ellipse work ?

EDIT : greedy will not work (see comment below) and it really need to be rectangles...

I should have been more specific. The attributes of a certain data point are cartesian coordinates. The ' physical size' of the cluster is determined by the attributes of a data point which would allow me to calculate an area.

jlv 2010-10-08 15:21:32

it's still not clear... e.g. you calculate the size of a cluster (of many data points, I assume), but the attributes of a single data point? Usually by size people mean either 1) the number of data points, or 2) the area in, say, Cartesian space. You mean something else? Also, is there are reason why none standard clustering methods will work for you?

tom10 2010-10-08 15:43:27

I'm not familiar enough with the realm of clustering methods to know what will work for me. As for the data, each point is an (x,y) coordinate, and I use those points to define a rectangle for a cluster. So the min x, min y for in the data points for a cluster would represent the bottom-left point of a rectangle. the max x and max y in the data point would present the top - right point of a rectangle. I hope that makes sense

jlv 2010-10-08 15:59:54

@jlv : can you give a simple example with the input and wanted output ? In cartesian coordinates it's not very common to use rectangles.

Loïc Février 2010-10-08 16:25:39

Sure, that all sounds reasonable. What I don't understand then, is what if your points span a bigger area than N times the maximum bounding area? In this case, for example, there won't be a way fulfill the criteria.

tom10 2010-10-08 16:28:24

My input is a set of coordinates (x, y). My desired output is N rectangles with the highest number of coordinates in them while each rectangle is below a certain area.

jlv 2010-10-08 16:29:41

Also, consider the example of 4 pts, (0,0), (10,10), (11,11), (22,22). What boxes would you like then? It seems if you wanted N=2, and minimal largest box you'd group ((0,0), (10,10)) and ((11,11), (22,22)). Is this what you'd like?

tom10 2010-10-08 16:30:59

Tom, to your previous comment, I fully expect that N * maximum bounding area won't span the area of the points. I just want to select boxes that have as many data-points in them while not overlapping and bound by a certain area. To your most recent comment, that's exactly what I want - maximize the number of points in boxes

jlv 2010-10-08 16:44:37

@jlv: OK. The update in my answer lists about all I have to say for this then.

tom10 2010-10-08 17:01:59

Thanks tom! Half the battle is terminology, and I feel like I merely just didn't know how to properly describe my problem.

jlv 2010-10-08 17:04:49

Sure, I hope it helps, though it is outside of what I know about, so you should only take it as a suggestion and not as any kind of fact.

tom10 2010-10-08 17:14:11

Unfortunately, it seems like I do have to use rectangles. Part of the spec. Do you have any advice how to find simply one rectangle with the maximum number of points?

jlv 2010-10-08 18:58:32

I don't think a greedy solution will work. Consider, for example, a square of 4 pts, length 1, centered within a square of 4 pts, length 2, and 4 bounding regions with length 1. The first greedy boundary will take the 4 pts of the smaller square, leaving 3 boundaries for the remaining larger outer square's 4 pts. The correct solution here, is, of course, one boundary for the two points in of each square's corners.

tom10 2010-10-08 19:02:17

(And I apologize for being a critic when I'm not really offering a solution of my own.)

tom10 2010-10-08 19:03:22

Don't apologize, glad to be proven wrong. Greedy seemed to good to be true bu not counter-example came to my mind. Thanks !

Loïc Février 2010-10-08 19:12:36

@jlv : can you give some more infos ? Max coordinates of points, typical value for N, max number of points ?

Loïc Février 2010-10-08 19:15:27

Actually, maybe I was wrong here... I'm assuming that the bounding shapes are squares, but if they can be rectangles, then my example isn't a valid counter example since two (even extremely thin rectangles) could grab the other 4 pts of the outer square?

tom10 2010-10-08 19:19:06

The coordinates will be latitude and longitude, so I'm writing this to span those features. N is 5, I'm expecting the number of points to be 50,000? Man, using rectangles makes this REALLY computationally expensive. Each time I want to merge two rectangles, I have to make sure that the new rectangle doesn't intersect any existing rectangle.

jlv 2010-10-08 19:27:49

With this many points it's really worth trying k-means with N=5. As a first pass this may give a sufficient solution, and then if it doesn't, maybe just shifting a few points would solve it. It's easy to do with the scipy kmeans implementation. It's not ideal for this problem, and kmeans itself is just a heuristic and not optimal, but it's probably a good start.

tom10 2010-10-08 20:25:04

ansaurus

tags:

views:

answers:

Clustering problem

related questions