views:

2143

answers:

17

I have a 2D area with "dots" distributed on this area. I now am trying to detect "clusters" of dots, that is, areas with a certain high density of dots.

Any thoughts on (or links to articles with thoughts on) how to elegantly detect these areas?

+8  A: 

You could try creating a Quadtree representation of the data. The shorter paths in the graph would correspond to high density areas.

Or, to put it more clearly: given a Quadtree and level-order traversal, each lower-level node composed of "dots" would represent a high density area. As the level of the nodes increases, such nodes represent lower density areas of "dots"

mepcotterell
I like this. I can think of some nifty algorithms for determining if other quad cells at the current level belong to the current "cluster", but unfortunately this comment field is too small...
Paul Tomblin
This is a *really* good idea so long as the clusters are convex.
Rich
Any links to Quadtree implementations? How fast would this be?
Mike Caron
A: 

I would calculate the distances from each point to all other points. Then sort the distances. Points that have a distance from each other that is below a threshold are considered Near. A group of points that is near to each other is a cluster.

The problem is that cluster may be clear to a human when he sees the graph, but does not have a clear mathematical definition. You need to define your near threshold, probably bu adjustuing it empirically until the result of your algorithm is (more or less) equal to what you perceive as being clustered.

Treb
That's n * (n + log(n))!
P Daddy
@P Daddy, if a faster algorithm is used but it doesn't return the right answers, its not what you want to use. Some problems are just NP complete. That's just how it is.
Kent Fredric
This is not one of them.
P Daddy
Shane MacLaughlin
That was kind of my point.
P Daddy
Fair enough. I was suggesting the idea was fine, it was simply the implementation was weak, and could be improved. btw is the last character on your opening remark meant to mean factorial or is it a sign of exasperation ;)
Shane MacLaughlin
lol... The latter. I didn't even notice that.
P Daddy
Hey guys, thanks for your input! Will look up the 'nearest neighbours' reference.
Treb
+3  A: 

How about a morphology approach?

Dilate the thresholded image by some number (depending on the target density of dots) then dots in a cluster will appear as a single object.

OpenCV supports morphological operations (as would a range of image processing libraries):

http://www.seas.upenn.edu/~bensapp/opencvdocs/ref/opencvref_cv.htm#cv_imgproc_morphology

Ian Hopkinson
+15  A: 
krusty.ar
Yikes, for a square of n x n points this is o(n^4), or for lower resolutions r, o(n^2*((n/r)^2)). Ok as a brute force method for small images maybe, but not good as a general solution.
Shane MacLaughlin
Of course I would't use this for anyting serious, just making a point, if the approach is useful, it can be optimized in many ways.
krusty.ar
Nice resulting product!
Kent Fredric
@kent: the interesting part would be to group dots accordint to the areas, this just makes more easy to see that there really are areas.@smacl: your comment made me feel bad so I updated it to be a little less ugly.
krusty.ar
@kent: I agree. Way cool image left over, but the process takes a while.
Mike Caron
+1  A: 

"Areas with a certain high density" implies that you know approximately how many dots per unit area you consider high. This leads me towards a grid approach, where you split your total area up into sub-areas of the appropriate size, then count the number of dots in each area. Once you find areas of your grid near your threshhold you can search neighboring areas of the grid too.

Bill the Lizard
The advantage of the quadtree approach over this approach is that the quadtree doesn't have to define the unit area ahead of time, just the number of dots that you'd consider "a cluster".
Paul Tomblin
I like the Quadtree approach, I upvoted it, but the unit area is part of the problem. A cluster isn't just a number of dots, it's a number of dots close to each other, that is, within a certain area.
Bill the Lizard
+12  A: 
Kent Fredric
See my answer below on genetic algorithms. In this case, if you knew ahead of time that toroidal clusters (or any unusual shape) were possible, you could simply build that possibility into your solution-generating mechanism and your fitness function.
MusiGenesis
@Kent, if you use a TIN based solution, you can group triangles by order of magnitude of longest edge length to solve this. Thus while we see a torus, there could be many more other interesting shapes in your super dense area worthy of their own analysis. Google multi-resolution TIN or Voronoi.
Shane MacLaughlin
@smacl, pretty cool idea.
Mike Caron
+11  A: 

Apply a blur filter to a copy of your 2D area. Something like

1 2 3 2 1
2 4 6 4 2
3 6 9 6 3
2 4 6 4 2
1 2 3 2 1

The "darker" areas now identify clusters of dots.

P Daddy
Interesting. Wondering whether this approach be extended into more than 2 dimensions...
Drew Noakes
Sure, but your matrix also grows exponentially. e.g. a 3D matrix would use 125 elements instead of 25. To achieve a similar effect as the above (let's call that M) would on 2D, you'd use M where z=0 and z=4, and M * 2 where z=1 and z=3, and M * 3 where z=2. Similarly with higher dimensions.
P Daddy
A: 

You could overlay a logical grid over your plane. If a grid has a certain number of contained dots, it is considered "dense" and could then be thinned. This is done a lot in GIS applications when working with cluster tolerances. Using the grid helps compartmentalize your thinning algorithm.

j0rd4n
+2  A: 

This really sounds like an academic question.

The solution that comes to mind involves r* trees. This divides your total area in individually sized and possibly overlapping boxes. After doing this you can then determine if each box represents a 'cluster' or not for you by calculating the mean distance.

R* Trees

If that approach becomes difficult to implement you may be better off splitting your datagrid into equal-sized subdivisions and determining if a cluster occurs in each; you will have to be very mindful of edge conditions with this approach though. I would suggest that after the initial division you go through and recombine areas with datapoints within a certain threshold of the defined edge.

Bryan
A: 

You could use a genetic algorithm for this. If you define a "cluster" as, say, a rectangular sub-area with a high dot density, you could create an initial set of "solutions", each of which consists of some number of randomly-generated, non-overlapping rectangular clusters. You would then write a "fitness function" which evaluates each solution - in this case, you would want the fitness function to minimize the total number of clusters while maximizing the dot density within each cluster.

Your initial set of "solutions" will all be terrible, most likely, but some will likely be slightly less terrible than the others. You use the fitness function to eliminate the worst solutions, then create the next generation of solutions by cross-breeding the "winners" from the last generation. By repeating this process generation by generation, you should end up with one or more good solutions to this problem.

For a genetic algorithm to work, the different possible solutions to a problem space have to be incrementally different from each other in terms of how well they solve the problem. Dot clusters are perfect for this.

MusiGenesis
+1  A: 

I think it depends on how much seperation there is between the dots and clusters. If the distances are large and irregular, I would initially triangulate the points, and then delete/hide all the triangles with statistically large edge lengths. The remaining sub-triangulations form clusters of arbitrary shape. Traversing the edges of these sub-triangulations yields polygons which can be used to determine which specific points lie in each cluster. The polygons can also be compared to know shapes, such as Kent Fredric's torus, as required.

IMO, grid methods are good for quick and dirty solutions, but get very hungry very quickly on sparse data. Quad trees are better, but TINs are my personal favourite for any more complex analysis.

Shane MacLaughlin
Is "tringulate" a specialized term for a specific type of triangulation, or just a typo?
Paul Tomblin
+5  A: 

There's a great tutorial of clustering algorithms here, they discuss K-means and K-gaussians.

Mike Caron
+13  A: 
Ivan
This sounds interesting, though (correct me if I'm wrong) this approach is not deterministic and therefore is not appropriate in some cases. Nice write-up.
Drew Noakes
Thanks.If deterministic means, that the algorithm **always** gives you the same output for a given input, then it is deterministic.
Ivan
+1  A: 
  1. Fit a probability density function to the data. I would use a "mixture of Gaussians" and fit it using Expectation Maximisation learning primed by the K-means algorithm. The K-means by itself can sometimes be sufficient without EM. The number of clusters itself would need to be primed with a model order selection algorithm.
  2. Then, each point can be scored with p(x) using the model. I.e. get the posterior probability that the point was generated by the model.
  3. Find the maximum p(x) to find the cluster centroids.

This can be coded very quickly in a tool like Matlab using a machine learning toolbox. MoG/EM learning/K-Means clustering are discussed widely on the web/standard texts. My favourite text is "Pattern classification" by Duda/Hart.

graveca
A: 

Cluster 3.0 includes a library of C methods for undertaking statistical clustering. It has a few different methods which may or may not solve your problem depedning on what form your dot clusters take. The library is available here here and is distributed under the Python license.

Ian Turner
A: 

Have you tried simple, off-the-shelf solutions like ImagXpress from Accusoft Pegasus?

The BlobRemoval method can be tuned for pixel count and density to find hole punches, even if they are not continuous. (you can also try a Dilate function to close gaps)

With a little playing around you likely can get the results you need in the vast majority of cases, with very little code or science.

C#:
public void DocumentBlobRemoval( Rectangle Area, int MinPixelCount, int MaxPixelCount, short MinDensity )

A: 

Let me organize this as a research paper

a. Problem Statement

To quote Epaga : "I have a 2D area with "dots" distributed on this area. I now am trying to detect "clusters" of dots, that is, areas with a certain high density of dots."

Note that nowhere is it mentioned that the dots are from an image. (Though they could be ordered as one).

b.Method case 1: If the points are simply dots (dots = points in 2D space). In this scenario, you will already have both x & y locations for all points. The problem reduces to one of clustering the points. Ivan has done a great job of proposing a solution. He also summarized other answers of similar flavor. My 2cts in addition to his post is that you consider whether you know the number of clusters a-priori or not. Algorithms (supervised vs un-supervised clustering can be chosen accordingly).

case 2: If the points indeed originate from an image. Here the problem needs to be clarified. Let me explain using this image alt text If no distinction is made on the gray value of the dots, groups 1, 2, 3, 4 & 5 are all "distinct clusters". However if distinction is made on the basis of gray value, cluster 5 is tricky, as the dot have different gray values.

Regardless, this problem can be reduced to case 1 by raster-scanning the image and storing co-ordinates of non-zero (non white) pixels. Clustering algorithms, as proposed earlier, can then be employed to compute the number of clusters and cluster centers.

nav