views:

1086

answers:

5

I have a large collection of rectangles, all of the same size. I am generating random points that should not fall in these rectangles, so what I wish to do is test if the generated point lies in one of the rectangles, and if it does, generate a new point.

Using R-trees seem to work, but they are really meant for rectangles and not points. I could use a modified version of a R-tree algorithm which works with points too, but I'd rather not reinvent the wheel, if there is already some better solution. I'm not very familiar with data-structures, so maybe there already exists some structure that works for my problem?

In summary, basically what I'm asking is if anyone knows of a good algorithm, that works in Python, that can be used to check if a point lies in any rectangle in a given set of rectangles.

edit: This is in 2D and the rectangles are not rotated.

+5  A: 

This Reddit thread addresses your problem:

I have a set of rectangles, and need to determine whether a point is contained within any of them. What are some good data structures to do this, with fast lookup being important?

If your universe is integer, or if the level of precision is well known and is not too high, you can use abelsson's suggestion from the thread, using O(1) lookup using coloring:

As usual you can trade space for time.. here is a O(1) lookup with very low constant. init: Create a bitmap large enough to envelop all rectangles with sufficient precision, initialize it to black. Color all pixels containing any rectangle white. O(1) lookup: is the point (x,y) white? If so, a rectangle was hit.

I recommend you go to that post and fully read ModermRonin's answer which is the most accepted one. I pasted it here, but note the original post contains links I did not preserve:

There are many ways to do this. But the best, I think, is using the 2d vector cross product. First, make sure the points of the rectangle are stored in clockwise order. Then do the vector cross product with 1) the vector formed by the two points of the side and 2) a vector from the first point of the side to the test point. Check the sign of the result - positive is inside (to the right of) the side, negative is outside. If it's inside all four sides, it's inside the rectangle. Or equivalently, if it's outside any of the sides, it's outside the rectangle. More explanation here. This method will take 3 subtracts per vector * times 2 vectors per side, plus one cross product per side which is three multiplies and two adds. 11 flops per side, 44 flops per rectangle. If you don't like the cross product, then you could do something like: figure out the inscribed and circumscribed circles for each rectangle, check if the point inside the inscribed one. If so, it's in the rectangle as well. If not, check if it's outside the circumscribed rectangle. If so, it's outside the rectangle as well. If it falls between the two circles, you're fucked and you have to check it the hard way. Finding if a point is inside a circle in 2d takes two subtractions and two squarings (= multiplies), and then you compare distance squared to avoid having to do a square root. That's 4 flops, times two circles is 8 flops - but sometimes you still won't know. Also this assumes that you don't pay any CPU time to compute the circumscribed or inscribed circles, which may or may not be true depending on how much pre-computation you're willing to do on your rectangle set. In any event, it's probably not a great idea to test the point against every rectangle, especially if you have a hundred million of them. Which brings us to the macro problem. How to avoid testing the point against every single rectangle in the set? In 2D, this is probably a quad-tree problem. In 3d, what generic_handle said - an octree. Off the top of my head, I would probably implement it as a B+ tree. It's tempting to use d = 5, so that each node can have up to 4 children, since that maps so nicely onto the quad-tree abstraction. But if the set of rectangles is too big to fit into main memory (not very likely these days), then having nodes the same size as disk blocks is probably the way to go. Watch out for annoying degenerate cases, like some data set that has ten thousand nearly identical rectangles with centers at the same exact point. :P Why is this problem important? It's useful in computer graphics, to check if a ray intersects a polygon. I.e., did that sniper rifle shot you just made hit the person you were shooting at? It's also used in real-time map software, like say GPS units. GPS tells you the coordinates you're at, but the map software has to find where that point is in a huge amount of map data, and do it several times per second.

Again, credit to ModermRonin...

Roee Adler
I have a hard time imagining how ModernRonin's "vector product" method can be faster than simply checking whether min_x <= x <= max_x and min_y <= y <= max_y… I'm not sure about the quadtree approach either, as they mostly help with storing points, not rectangles…
EOL
The vector product permits a rectangle of arbitrary orientation; I’m sure other methods are more appropriate when all rectangles are aligned to the axes of some coordinate system.
A: 

I suggest you take a look at BSP trees (and possible quadtrees or octrees, links available on that page as well). They are used to partition the whole space recursively and allow you to quickly check for a point which rectangles you need to check at all.

At minimum you just have one huge partition and need to check all rectangles, at maximum your partitions get so small, that they get down to the size of single rectangles. Of course the more fine-grained the partition, the longer you need to walk down the tree in order to find the rectangles you want to check.

However, you can freely decide how many rectangles are suitable to be checked for a point and then create the corresponding structure.

Pay attention to overlapping rectangles though. As the BSP tree needs to be precomputed anyways, you may as well remove overlaps during that time, so you can get clear partitions.

Frank
A: 

Your R-tree approach is the best approach I know of (that's the approach I would choose over quadtrees, B+ trees, or BSP trees, as R-trees seem convenient to build in your case). Caveat: I'm no expert, even though I remember a few things from my senior year university class of algorithmic!

EOL
A: 

Why not try this. It seems rather light on both computation and memory.

Consider the projections of all the rectangles onto the base line of your space. Denote that set of line intervals as

 {[Rl1, Rr1], [Rl2, Rr2],..., [Rln, Rrn]}, ordered by increasing left coordinates.

Now suppose your point is (x, y), start a search at the left of this set until you reach a line interval that contains the point x.

If none does, your point (x,y) is outside all rectangles.

If some do, say [Rlk, Rrk], ..., [Rlh, Rrh], (k <= h) then just check whether y is within the vertical extent of any of these rectangles.

Done.

Good luck.

John Doner

John R Doner
+1  A: 

For rectangles that are aligned with the axes, you only need two points (four numbers) to identify the rectangle - conventionally, bottom-left and top-right corners. To establish whether a given point (Xtest, Ytest) overlaps with a rectangle (XBL, YBL, XTR, YTR) by testing both:

  • Xtest >= XBL && Xtest <= XTR
  • Ytest >= YBL && Ytest <= YTR

Clearly, for a large enough set of points to test, this could be fairly time consuming. The question, then, is how to optimize the testing.

Clearly, one optimization is to establish the minimum and maximum X and Y values for the box surrounding all the rectangles (the bounding box): a swift test on this shows whether there is any need to look further.

  • Xtest >= Xmin && Xtest <= Xmax
  • Ytest >= Ymin && Ytest <= Ymax

Depending on how much of the total surface area is covered with rectangles, you might be able to find non-overlapping sub-areas that contain rectangles, and you could then avoid searching those sub-areas that cannot contain a rectangle overlapping the point, again saving comparisons during the search at the cost of pre-computation of suitable data structures. If the set of rectangles is sparse enough, there may be no overlapping, in which case this degenerates into the brute-force search. Equally, if the set of rectangles is so dense that there are no sub-ranges in the bounding box that can be split up without breaking rectangles.

However, you could also arbitrarily break up the bounding area into, say, quarters (half in each direction). You would then use a list of boxes which would include more boxes than in the original set (two or four boxes for each box that overlapped one of the arbitrary boundaries). The advantage of this is that you could then eliminate three of the four quarters from the search, reducing the amount of searching to be done in total - at the expense of auxilliary storage.

So, there are space-time trade-offs, as ever. And pre-computation versus search trade-offs. If you are unlucky, the pre-computation achieves nothing (for example, there are two boxes only, and they don't overlap on either axis). On the other hand, it could achieve considerable search-time benefit.

Jonathan Leffler