tags:

views:

182

answers:

8

I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).

The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).

Does anyone know any algorithm which may help me with this particular task?

+2  A: 

R-tree

Doug Currie
-1: R-tree is not an algorithm, it's a data structure, and you don't propose an algorithm for populating the R-tree.
High Performance Mark
@HPM, perhaps if you look at the page I identified, and search for the heading Algorithm, you would be satisfied.
Doug Currie
+2  A: 

Just to understand the problem. The following is crude and perform badly, but I want to know if the result is what you want>

Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)

Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.

Is that what you want to achieve?

belisarius
Yes, exactly. That would be perfect. I was thinking about initially create a matrix (1000x1000 or so) and store number of points in each cell. Then I can use your bisecting algorithm.
Karol Kolenda
+1  A: 

I think I'd start with the following, which is close to what @belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.

  1. Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
  2. Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
  3. Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
  4. Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.

I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.

If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.

I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.

High Performance Mark
Thinking again and summing up both answers, cutting thin slice rectangles is equivalent to one dimensional problem. So, if you have N points and r rectangles, just take in account the x coordinate of the points and sum up N/r points in each partition
belisarius
A: 

Good question.

I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here

You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.

Grembo
The poster has asked for equipartitioning of points, not minimizing or maximizing interactions. The latter can be NP-hard (e.g. graph max cut); the former is `O(n log n)`.
Rex Kerr
A: 

Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?

fbrereto
+2  A: 

You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)

Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.

Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.

If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.

Rex Kerr
Maybe I'm missing the point of kd-trees, but they don't seem to create rectangles with points INSIDE the rectangles as the question states. The second comment says he's looking for equal number of points within a rectangle. The reference contained many other references that I thought might be useful.
Grembo
@Grembo - It's a small implementation detail to switch to making the lines _between_ points instead of _on_ points.
Rex Kerr
KD-Tree was indeed my first reflex too, I balked at first because of the huge set but the sampling approach alleviates the issue.
Matthieu M.
A: 

That's looks like Cluster analysis.

ony
In general you are right - it is a cluster analysis problem. But it is quite unusual problem so I don't think that ready-to-use methods from the cluster analysis toolbox will useful. I'm only interested in rectangles - not any clusters, I have lots of object and I don't care that much about precision.
Karol Kolenda
Use manhattan distance.
ony
A: 

Would a QuadTree work?

A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:

  • They decompose space into adaptable cells
  • Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
  • The tree directory follows the spatial decomposition of the Quadtree
David Basarab