views:

196

answers:

4

I have a rectangular planar grid, with each cell assigned some integer weight. I am looking for an algorithm to identify clusters of 3 to 6 adjacent cells with higher-than-average weight. These blobs should have approximately circular shape.

For my case the average weight of the cells not containing a cluster is around 6, and that for cells containing a cluster is around 6+4, i.e. there is a "background weight" somewhere around 6. The weights fluctuate with a Poisson statistic.

For small background greedy or seeded algorithms perform pretty well, but this breaks down if my cluster cells have weights close to fluctuations in the background i.e. they will tend to find a cluster even though there is nothing. Also, I cannot do a brute-force search looping through all possible setups because my grid is large (something like 1000x1000) and I plan to do this very often (10^9 times). I have the impression there might exist ways to tackle this in graph theory. I heard of vertex-covers and cliques, but am not sure how to best translate my problem into their language. I know that graph theory might have issues with the statistical nature of the input, but I would be interest to see what algorithms from there could find, even if they cannot identify every cluster.

Here an example clipping: the framed region has on average 10 entries per cell, all other cells have on average 6. Of course the grid extends further.

| 8|  8|  2|  8|  2|  3| 
| 6|  4|  3|  6|  4|  4| 
        ===========
| 8|  3||13|  7| 11|| 7|
|10|  4||10| 12|  3|| 2|
| 5|  6||11|  6|  8||12|
        ===========
| 9|  4|  0|  2|  8|  7|
+1  A: 

For graph theory solutions there are a couple of sentences in wikipedia, but you are probably best posting on MathOverflow. This question might also be useful.

The traditional method (and probably best considering its ubiquity) in computing to solve these is raster analysis - well known in the world of GIS and Remote Sensing, so there are a number of tools that provide implementations. Keywords to use to find the one most suited to your problem would be raster, nearest neighbor, resampling, and clustering. The GDAL library is often the basis for other tools.

E.g. http://clusterville.org/spatialtools/index.html

You could try checking out the GDAL library and source code to see if you can use it in your situation, or to see how it is implemented.

For checking for circular shapes you could convert remaing values to polygons and check the resultant feature

http://www.gdal.org/gdal_polygonize.html

geographika
Thanks for the links geographika.Since my signal is very weak I wouldn't like to discard the background though (this opens up another set of problems with fluctuations anyway). And for shape, I don't really care about it, I just want to find blobs.In the end these ideas would still (need to) go along the lines of greedy or seeded algorithms which are known to have issues here. That's why I am looking for different approaches.
honk
@geographika: Thanks again for the pointers. But as I wrote, I am already aware of these tools and looking for a totally different direction to attack this (mostly to not to get too used to the limitations of thinking in their terms I guess). I guess the question is just poorly phrased or lacks focus.
honk
A: 

I'm not sure I see a graph theory analogy, but you can speed things up by pre-computing an area integral. This feels like a multi-scale thing.

A[i,j] = Sum[Sum[p[u,v], {u, 0, i}, {v, 0, j}]];

Then the average brightness of a rectangular (a,b), (c,d) region is

(A[c,d] - (A[c,b] + A[a,d]) + A[a,b])/((c-a)(d-b))

Overflow is probably not your friend if you have big numbers in your cells.

Justin
Could you explain how this could be useful for identifing the blob? The problem is really that the "signal" is very close to the "background".
honk
Instead of computing the average value in a k neighborhood of a pixel (which requires k*k memory accesses and additions) you can use 4 memory access and 4 additions.
Justin
This type of thing might work well if you started with larger regions of the image and tried to recursively find areas of similar signal/noise ratio, then apply your area averaging filter.
Justin
A: 

Use the union-find algorithm for clustering? It's very fast.

I guess the graph would result from considering each pair of neighboring high-valued cells connected. Use the union-find algorithm to find all clusters, and accept all those above a certain size, perhaps with shape constraints too (eg based on average squared distance from cluster center vs cluster size). It's a trivial variation on the union-find algorithm to collect statistics that you would need for this as you go (count, sum of x, sum of x^2, sum of y, sum of y^2).

Paul Harrison
A: 

If you are just looking for a way to translate your problem into a graph problem heres what you could do.

From each point look at all your neighbors (this could be the 8 adjacent squares or the 4 adjacent depending on what you want). Look for the neighbor with the max value, if it is larger than you then connect yourself to this square. if it is smaller then do nothing.

after this you will have a forest or possibly a tree (though i imagine that that is unlikely). This is one possible translation of your matrix to a graph. Im not sure if it is the most useful translation.

luke
That sounds like a very interesting approach. Would I create an extra tree to account for fluctuations?
honk
im not sure what you mean?
luke
@luke: What you described in your answer clusters cells to the neighbor with the highest count. This count can fluctuate, so the combination might not belong to the same cluster.
honk