views:

189

answers:

2

I'm working on a school project that involves taking a lat/long point and finding the top five closest points in a known list of places. The list is to be stored in memory, with the caveat that we must choose an "appropriate data structure" -- that is, we cannot simply store all the places in an array and compare distances one-by-one in a linear fashion. The teacher suggested grouping the place data by US State to prevent calculating the distance for places that are obviously too far away. I think I can do better.

From my research online it seems like an R-Tree or one of its variants might be a neat solution. Unfortunately, that sentence is as far as I've gotten with understanding the actual technique, as the literature is simply too dense for my non-academic head.

  • Can somebody give me a really high overview of what the process is for populating an R-Tree with lat/long data, and then traversing the tree to find those 5 nearest neighbors of a given point?

  • Additionally the project is in C, and I don't have to reinvent the wheel on this, so if you've used an existing open source C implementation of an R Tree I'd be interested in your experiences.

UPDATE: This blog post describes a straightforward search algorithm for a regionally partitioned space (like a PR quadtree). Hope that helps a future reader.

+5  A: 

Have you considered alternative data structures? I believe, instead of R-tree a Point Quadtree would be more effective for your need.Spatial Index Demos provides some demos for a list of possible data structures including R-tree and Point Quadtree. Hope it gives an insight.

tafa
+1 - if you only need to store points then a quad tree will do the job and is fairly simple to implement. R-Trees allow overlapping bounding boxes for arbitrary shapes and the OP doesn't appear to need that.
ConcernedOfTunbridgeWells
Spatial index demos really helped me grok this stuff, thank you!
roufamatic
+3  A: 

Quad Trees

A quad tree takes a square of space and divides it into four children with half the dimensions along the X and Y axis.

+---+---+
|   |   |  Each square is a child
|   |   |  of the parent; when you
+---+---+  get to leaves a node has
|   |   |  a single point or a list
|   |   |  of points.
+---+---+

This data structure is recursive and you search for points by checking which child holds the point until you get to the leaf. A leaf either has a single member (point with X,Y coords) or a list of members, depending on the implementation. If you fill up a node you split it into 4 and distribute the children. Essentially, the data structure is a generalisation of a binary tree, so it is not necessarily balanced.

Balancing a quad tree may not be necessary for your purposes and is left as an exercise for the reader - try searching on the web for 'balanced quad tree'

Note that this data structure cannot index items that can overlap, but if you're only storing points this won't be a problem.

Finding nearest neighbours in a quad tree

Off the top of my head, here's a quick and dirty algorithm for finding the 'n' nearest neighbours to your point. It's not necessarily optimially efficient, but it will be fairly simple to implement. If someone has a link to a better one, feel free to post it in a comment or answer.

  • Locate the quad tree node containing your point, keeping a list of its parents.

  • Push all of the points in the node into a priority queue based on their distance from your base point (i.e. by the length of the hypotenuse per Pythagoras' theorem). Depending on the implementation there may be one or more per node. For a simple implementation of a priority queue data structure, look up 'binary heap'.

  • If any of the 'n' points are further away then the edges of the bounding box, add the contents of its neighbours. i.e. If your base point is close to the edge of the bounding box, it is possible that neighbouring tree nodes might contain points that are closer than the points found within your bounding box. You will need to back up the tree to do this, which is why you need to keep track of your parent nodes.

  • When all of the 'n' closest points are closer than the edges of your bounding box you know that there could not possibly be neighbours that you have missed. Therefore, the 'n' closest points within this box must be your 'n' closest neighbours.

ConcernedOfTunbridgeWells