views:

1254

answers:

10

I'm wondering if there is an algorithm for calculating the nearest locations (represented by lat/long) in better than O(n) time.

I know I could use the Haversine formula to get the distance from the reference point to each location and sort ASC, but this is inefficient for large data sets.

How does the MySQL DISTANCE() function perform? I'm guessing O(n)?

+1  A: 

Better than O(n)? Only if you go the way of radix sort or store the locations with hash keys that represent the general location they are in.

For instance, you could divide the globe with latitude and longitude to the minutes, enumerate the resulting areas, and make the hash for a location it's area. So when the time comes to get the closest location, you only need to check at most 9 hash keys -- you can test beforehand if an adjacent grid can possibly provide a close location than the best found so far, thus decreasing the set of locations to compute the distance to. It's still O(n), but with a much smaller constant factor. Properly implemented you won't even notice it.

Or, if the data is in memory or otherwise randomly accessible, you could store it sorted by both latitude and longitude. You then use binary search to find the closest latitude and longitude in the respective data sets. Next, you keep reading locations with increasing latitude or longitude (ie, the preceding and succeeding locations), until it becomes impossible to find a closer location.

You know you can't find a close location when the latitude of the next location to either side of the latitude-sorted data wouldn't be closer than the best case found so far even if they belonged in the same longitude as the point from which distance is being calculated. A similar test applies for the longitude-sorted data.

This actually gives you better than O(n) -- closer to O(logN), I think, but does require random, instead of sequential, access to data, and duplication of all data (or the keys to the data, at least).

Daniel
Anything better than O(n) will require random access, I wouldn't lose sleep over it.
Mark Ransom
Or you could avoid inventing your own algorithms, and use the built in spatial indexing support, or at worst case, an R-Tree.
Nick Johnson
Hey, if you don't like my answer, don't vote for it. This here is a bubble sort of good answers. Having said that, R-Tree doesn't seem a good match for the problem. As for spatial indexing support, I don't know its properties, so I can't vouch for it.
Daniel
An R-Tree _is_ spatial indexing - and it's already implemented in MySQL in the form of their spatial extensions. Why do you think R-Tree is a bad match?
Nick Johnson
There are other types of spatial indexing. I don't know the properties of whatever MySQL is choosing. Anyway, R-Tree has a good average but a lousy worst case for the exact problem mentioned. Supposedly, P-R-Tree doesn't have this problem -- actually, I have no reason o doubt that. But, again, I don't know WHAT MySQL is using. I personally like kd-tree much better, and I voted for that.
Daniel
A: 

I guess you could do it theoretically if you had a large enough table to do this... secondly, perhaps caching correctly could get you very good average case?

CookieOfFortune
+1  A: 

If you are looking for the (1) closest location, there's no need to sort. Simply iterate through your list, calculating the distance to each point and keeping track of the closest one. By the time you get through the list, you'll have your answer.

Even better would be to introduce the concept of grids. You would assign each point to a grid. Then, for your search, first determine the grid you are in and perform your calculations on the points in the grid. You'll need to be a little careful though. If the test location is close to the boundary of a grid, you'll need to search those grid(s) as well. Still, this is likely to be highly performant.

G Mastros
The algorithm you gave is O(n), not O(1). He wants better.
Daniel
However, the "grid" idea is close to the hashtable mentioned in another answer.
Michael Todd
I misread his answer, actually. Still, the first one is O(n).
Daniel
+1  A: 

You mention MySql, but there are some pretty sophisticated spatial features in SQL Server 2008 including a geography data type. There's some information out there about doing the types of things you are asking about. I don't know spatial well enough to talk about perf. but I doubt there is a bounded time algorithm to do what you're asking, but you might be able to do some fast set operations on locations.

JP Alioto
Pretty much any major RDBMS has spatial features these days, including MySQL and PostgreSQL.
Nick Johnson
+2  A: 

If the data set being searched is static, e.g., the coordinates of all gas stations in the US, then a proper index (BSP) would allow for efficient searching. Postgres has had good support since the mid 90's for 2-dimensional indexed data so you can do just this sort of query.

caskey
+6  A: 

If you use a kd-tree to store your points, you can do this in O(log n) time (expected) or O(sqrt(n)) worst case.

rampion
How might I implement something like this using MySQL? Any good resources I might find for indexing and querying lat/long points using a kd-tree?
1nsane
A: 

An R-Tree index can be used to speed spatial searches like this. Once created, it allows such searches to be better than O(n).

Jonathan Leffler
+1  A: 

I haven't looked at it myself, but Postgres does have a module dedicated to the management of GIS data.

In an appliation I worked on in a previous life we took all of the data, computed it's key for a quad-tree (for 2D space) or an oct-tree (for 3D space) and stored that in the database. It was then a simple matter of loading the values from the database (to prevent you having to recompute the quad-tree) and following the standard quad-tree search algorithm.

This does of course mean you will touch all of the data at least once to get it into the data structure. But persisting this data-structure means you can get better lookup speeds from then on. I would imagine you will do a lot of nearest-neighbour checks for each data-set.

(for kd-tree's wikipedia has a good explanation: http://en.wikipedia.org/wiki/Kd-tree)

Aidos
+1  A: 

You need a spatial index. Fortunately, MySQL provides just such an index, in its Spatial Extensions. They use an R-Tree index internally - though it shouldn't really matter what they use. The manual page referenced above has lots of details.

Nick Johnson
+1  A: 

I wrote a article about Finding the nearest Line at DDJ a couple of years ago, using a grid (i call it quadrants). Using it to find the nearest point (instead of lines) would be just a reduction of it.

Using Quadrants reduces the time drastically, although the complexity is not determinable mathematically (all points could theoretically lie in a single quadrant). A precondition of using quadrants/grids is, that you have a maximum distance for the point searched. If you just look for the nearest point, without giving a maximum distance, you cant use quadrants.

In this case, have a look at A Template for the Nearest Neighbor Problem (Larry Andrews at DDJ), having a retrival complexity of O(log n). I did not compare the runtime of both algorithms. Probably, if you have a reasonable maximum width, quadrants are better. The better general purpose algorithm is the one from Larry Andrews.

RED SOFT ADAIR