views:

789

answers:

8

Given a set of several million points with x,y coordinates, what is the algorithm of choice for quickly finding the top 1000 nearest points from a location? "Quickly" here means about 100ms on a home computer.

Brute force would mean doing millions of multiplications and then sorting them. While even a simple Python app could do that in less than a minute, it is still too long for an interactive application.

The bounding box for the points will be known, so partitioning the space into a simple grid would be possible. However the points are distributed somewhat unevenly, so I suspect most grid squares would be empty and then suddenly some of them would contain a large portion of the points.

Edit: Does not have to be exact, actually can be quite inaccurate. It wouldn't be a huge deal if the top 1000 are actually just some random points from the top 2000 for example.

Edit: Set of points rarely changes.

+9  A: 

How about using quadtree?

You divide area to rectangles, if area has low density of points, rectangles are large, and if area has high density of points, rectangles will be small. You recursively subdivide each rectangle to four sub rectangles until rectangles are small enough or contain few enough points.

You can then start looking at points in rectangles near the location, and move outwards until you have found your 1000 points.

Code for this could get somewhat complex, so maybe you should try first with the simple grid and see if it is fast enough.

Juha Syrjälä
+7  A: 

Quadtrees are nice, but BSP trees are guaranteed to run in O(log n) time. I think quadtrees require a finite bounding volume, and and there are some degenerate cases where quadtrees fail miserably, such as when a large number of points occupy the same relatively small space.

That being said, Quadtrees are arguably easier to implement and quite effective in most common situations. It's what UPS uses in their routing algorithms, because it's drawbacks don't pose significant problems in practice, probably because cities tend to be spread out over the region of interest.

redmoskito
+1  A: 

I assume the points are in a database or some searchable indexed location? If so it should be pretty quick. From the given point you can have a range on the x and y axis and get all locations within that range (i.e. specify the top left most corner x(a) and y(b) and bottom most right corner x(c) and y(d)).

Then do a query where for points where y >= b AND y <= d AND x >= a AND x <=c. this will be quick assuming you have indexes on the x and y coordinates seperatly. (assuming origin is 0,0 at top left).

You can then increase (or decrease if result is huge) this range by z until the number of points within the result set is >= 1000. Through some trial runs you should be able to come up with a standard deviation and other statistical numbers that will help you determine the size of the rectangle to start with. Your program can also tune its self for this based on the results it gets.

Once you have the rough data set its pretty simple maths to work out the distance between each point and the source point.

RM
They are not in a relational database, and also I remember reading that a relational database like MySQL can only use one index at a time in a situation like this.
Bemmu
This sounds like a great idea. If you've got the indexes set up right, the database software has some nice algorithms up it's sleeve to make these queries really fast. If they aren't in a DB, write a quick script to drop them in one, and at least test it.It's not necessarily the very fastest solution, but it's likely to be the fastest to implement, and your time is worth more than a few CPU cycles, right?
redmoskito
Doing range queries on two different properties can _not_ be efficiently satisfied using only 1D indexes. Relational databases aren't magic.
Nick Johnson
+4  A: 

You want to use a structure like a Quad tree, or an RTree. These are multidimensional index structures.

The key is using a good "space filling curve", which is what helps define the nearness of points. A simple space filling curve is a Zorder, but you would be more interested in something like a hilbert curve.

http://en.wikipedia.org/wiki/Space_filling_curve

I don't know of any prepackaged implementations of this stuff. I recently implemented my own RTree in 2 dimensions that only supports bulk loading and searches (via a provided bounding box).

One drawback here is that your points have to be contained in a finite region. There know there are space filling curves that work for spaces that are not finite, but I do not know anything about them.

Tom
These space-filling curves are an amazingly fresh point of view for me to think about the problem, thanks a lot!
Bemmu
A: 

Try taking a look at morton codes:

http://www.codexon.com/posts/morton-codes/

Unknown
A: 

If the set of points rarely changes, you could also consider using a voronoi diagram. I'm not sure if that helps finding the first point faster, but it should make it a lot easier to find the next 999 points.

Niki
+2  A: 

In addition to the QuadTree and BSP tree suggestions, you should look up nearest neighbour searching. The choice of algorithm is based on how often you are adding to your base dataset. If you are adding and removing often, tree solutions are superior. If the data is more static, nearest neighbour searching and voronoi diagrams can be much faster and scale better.

Shane MacLaughlin
A: 

i know its been said as not being the fastest if you want REALLY REALLY fast results by seeing i found this post from google i thought i'd add my SQL solution that i used a while ago in the form of a stored proc. It looks for locations close by the a coord and returns them by distance.

I hope it helps someone :)

CREATE PROCEDURE [dbo].[getstores] @lat float,  @lng float AS
DECLARE @radius float, @DegToRad float
SET @DegToRad = 57.29577951
SET @radius = 25000
SELECT TOP 10
    name
    ,sto_lat
    ,sto_lng
    ,postcode
    ,ROUND((ACOS((SIN(@lat/57.2958) * SIN(sto_lat/@DegToRad)) +(COS(@lat/@DegToRad) * COS(sto_lat/@DegToRad) *COS(sto_lng/@DegToRad - @lng/@DegToRad))))* 6387.7, 2) AS distance
FROM store
WHERE (sto_lat >= @lat - (@radius/111))
And (sto_lat <= @lat + (@radius/111))
AND (sto_lng >= @lng - (@radius/111))
AND (sto_lng <= @lng + (@radius/111))
AND (
     ISNUMERIC(sto_lat) = 1
    AND
    ISNUMERIC(sto_lat) = 1
)
ORDER BY distance

NOTE: i have already stated that this is not the best solution for this question simply maybe for someone who found this on google like me

Doug