I have a 2D area with "dots" distributed on this area. I now am trying to detect "clusters" of dots, that is, areas with a certain high density of dots.
Any thoughts on (or links to articles with thoughts on) how to elegantly detect these areas?
I have a 2D area with "dots" distributed on this area. I now am trying to detect "clusters" of dots, that is, areas with a certain high density of dots.
Any thoughts on (or links to articles with thoughts on) how to elegantly detect these areas?
You could try creating a Quadtree representation of the data. The shorter paths in the graph would correspond to high density areas.
Or, to put it more clearly: given a Quadtree and level-order traversal, each lower-level node composed of "dots" would represent a high density area. As the level of the nodes increases, such nodes represent lower density areas of "dots"
I would calculate the distances from each point to all other points. Then sort the distances. Points that have a distance from each other that is below a threshold are considered Near. A group of points that is near to each other is a cluster.
The problem is that cluster may be clear to a human when he sees the graph, but does not have a clear mathematical definition. You need to define your near threshold, probably bu adjustuing it empirically until the result of your algorithm is (more or less) equal to what you perceive as being clustered.
How about a morphology approach?
Dilate the thresholded image by some number (depending on the target density of dots) then dots in a cluster will appear as a single object.
OpenCV supports morphological operations (as would a range of image processing libraries):
http://www.seas.upenn.edu/~bensapp/opencvdocs/ref/opencvref_cv.htm#cv_imgproc_morphology
"Areas with a certain high density" implies that you know approximately how many dots per unit area you consider high. This leads me towards a grid approach, where you split your total area up into sub-areas of the appropriate size, then count the number of dots in each area. Once you find areas of your grid near your threshhold you can search neighboring areas of the grid too.
Apply a blur filter to a copy of your 2D area. Something like
1 2 3 2 1
2 4 6 4 2
3 6 9 6 3
2 4 6 4 2
1 2 3 2 1
The "darker" areas now identify clusters of dots.
You could overlay a logical grid over your plane. If a grid has a certain number of contained dots, it is considered "dense" and could then be thinned. This is done a lot in GIS applications when working with cluster tolerances. Using the grid helps compartmentalize your thinning algorithm.
This really sounds like an academic question.
The solution that comes to mind involves r* trees. This divides your total area in individually sized and possibly overlapping boxes. After doing this you can then determine if each box represents a 'cluster' or not for you by calculating the mean distance.
If that approach becomes difficult to implement you may be better off splitting your datagrid into equal-sized subdivisions and determining if a cluster occurs in each; you will have to be very mindful of edge conditions with this approach though. I would suggest that after the initial division you go through and recombine areas with datapoints within a certain threshold of the defined edge.
You could use a genetic algorithm for this. If you define a "cluster" as, say, a rectangular sub-area with a high dot density, you could create an initial set of "solutions", each of which consists of some number of randomly-generated, non-overlapping rectangular clusters. You would then write a "fitness function" which evaluates each solution - in this case, you would want the fitness function to minimize the total number of clusters while maximizing the dot density within each cluster.
Your initial set of "solutions" will all be terrible, most likely, but some will likely be slightly less terrible than the others. You use the fitness function to eliminate the worst solutions, then create the next generation of solutions by cross-breeding the "winners" from the last generation. By repeating this process generation by generation, you should end up with one or more good solutions to this problem.
For a genetic algorithm to work, the different possible solutions to a problem space have to be incrementally different from each other in terms of how well they solve the problem. Dot clusters are perfect for this.
I think it depends on how much seperation there is between the dots and clusters. If the distances are large and irregular, I would initially triangulate the points, and then delete/hide all the triangles with statistically large edge lengths. The remaining sub-triangulations form clusters of arbitrary shape. Traversing the edges of these sub-triangulations yields polygons which can be used to determine which specific points lie in each cluster. The polygons can also be compared to know shapes, such as Kent Fredric's torus, as required.
IMO, grid methods are good for quick and dirty solutions, but get very hungry very quickly on sparse data. Quad trees are better, but TINs are my personal favourite for any more complex analysis.
There's a great tutorial of clustering algorithms here, they discuss K-means and K-gaussians.
This can be coded very quickly in a tool like Matlab using a machine learning toolbox. MoG/EM learning/K-Means clustering are discussed widely on the web/standard texts. My favourite text is "Pattern classification" by Duda/Hart.
Cluster 3.0 includes a library of C methods for undertaking statistical clustering. It has a few different methods which may or may not solve your problem depedning on what form your dot clusters take. The library is available here here and is distributed under the Python license.
Have you tried simple, off-the-shelf solutions like ImagXpress from Accusoft Pegasus?
The BlobRemoval method can be tuned for pixel count and density to find hole punches, even if they are not continuous. (you can also try a Dilate function to close gaps)
With a little playing around you likely can get the results you need in the vast majority of cases, with very little code or science.
C#:
public void DocumentBlobRemoval(
Rectangle Area,
int MinPixelCount,
int MaxPixelCount,
short MinDensity
)
Let me organize this as a research paper
a. Problem Statement
To quote Epaga : "I have a 2D area with "dots" distributed on this area. I now am trying to detect "clusters" of dots, that is, areas with a certain high density of dots."
Note that nowhere is it mentioned that the dots are from an image. (Though they could be ordered as one).
b.Method case 1: If the points are simply dots (dots = points in 2D space). In this scenario, you will already have both x & y locations for all points. The problem reduces to one of clustering the points. Ivan has done a great job of proposing a solution. He also summarized other answers of similar flavor. My 2cts in addition to his post is that you consider whether you know the number of clusters a-priori or not. Algorithms (supervised vs un-supervised clustering can be chosen accordingly).
case 2: If the points indeed originate from an image. Here the problem needs to be clarified. Let me explain using this image If no distinction is made on the gray value of the dots, groups 1, 2, 3, 4 & 5 are all "distinct clusters". However if distinction is made on the basis of gray value, cluster 5 is tricky, as the dot have different gray values.
Regardless, this problem can be reduced to case 1 by raster-scanning the image and storing co-ordinates of non-zero (non white) pixels. Clustering algorithms, as proposed earlier, can then be employed to compute the number of clusters and cluster centers.