views:

1186

answers:

7

I'm working with a large set of points represented by latitude/longitude pairs (the points are not necessarily unique, there could be several points in the set that are at the same location). The points are stored in a database.

What I need to do is figure out a way to efficiently perform a search to get the number of points that lie within a given radius (say, 25 miles) of an arbitrary point. The count does not need to be 100% accurate - more importantly, it just has to be fast, and reasonably close to the correct count. Doing this with SQL is possible, by using a query with some trigonometry in the WHERE clause to filter points by their distance to a reference point. Unfortunately, this query is very, very expensive and caching is not likely to provide much help because the locations will be very spread out.

I'm ultimately looking to build some kind of in-memory structure that will be able to handle this kind of operation efficiently - trading off some of the accuracy and liveness of the data (maybe rebuilding it only once a day) in return for speed. I've been doing some research on kd-trees, but i'm not clear yet on how well this can be applied to latitude/longitude data (as opposed to x,y data in a 2d plane).

If anybody has any ideas or solutions I should look into, I'd really appreciate it - so thanks in advance.

A: 

Hmm... use a WHERE clause to select a square around your point of origin and use HAVING to calculate the real distance?

Bombe
+2  A: 

Use some kind of search tree for spatial data, e.g. a quad tree. More such data structures are referenced under "See also".

starblue
+1  A: 

Could you maybe supply a sample of your existing expensive query?

If you're doing proper great-circle calculation based on taking the sine() and cosine() of the reference point and the other data points, then a very substantial optimisation could be made by actually storing those sin/cos values in the database in addition to the lat/long values.

Alternatively, just use your database to extract a rectangle of lat/long ranges that match, and only afterwards filter out the ones that are outside the true circular radius.

But do bear in mind that one degree of longitude is a somewhat shorter distance at high latitudes than at the equator. It should be easy to figure out the right aspect ratio for that rectangle, though. You'd also have errors if you need to consider areas very close to the poles, as a rectanglar selection wouldn't cope with a circle that overlapped a pole.

Alnitak
+4  A: 

Use R-Trees.

In Oracle, using Oracle Spatial, you can create an index:

CREATE INDEX ix_spatial ON spatial_table (locations) INDEXTYPE IS MDSYS.SPATIAL_INDEX;

that will create an R-Tree for you and search over it.

You may use any Earth Model you like: WGS84, PZ-90 etc.

Quassnoi
A: 

If it's an option for you, PostGIS is a nice free spatial database.

Paul Tomblin
+1  A: 

This UDF (SQL Server) will get you the distance between two lat/lon points:

CREATE FUNCTION [dbo].[zipDistance] (
    @Lat1 decimal(11, 6),
    @Lon1 decimal(11, 6),
    @Lat2 decimal(11, 6),
    @Lon2 decimal(11, 6)
)
RETURNS
    decimal(11, 6) AS
BEGIN

    IF @Lat1 = @Lat2 AND @Lon1 = @Lon2
     RETURN 0 /* same lat/long points, 0 distance = */

    DECLARE @x decimal(18,13)
    SET @x = 0.0

    /* degrees -> radians */
    SET @Lat1 = @Lat1 * PI() / 180
    SET @Lon1 = @Lon1 * PI() / 180
    SET @Lat2 = @Lat2 * PI() / 180
    SET @Lon2 = @Lon2 * PI() / 180

    /* accurate to +/- 30 feet */
    SET @x = Sin(@Lat1) * Sin(@Lat2) + Cos(@Lat1) * Cos(@Lat2) * Cos(@Lon2 - @Lon1)
    IF 1 = @x
     RETURN 0

    DECLARE @EarthRad decimal(5,1)
    SET @EarthRad = 3963.1

    RETURN @EarthRadius * (-1 * ATAN(@x / SQRT(1 - @x * @x)) + PI() / 2)

END

And, obviously, you can use this in a separate query, such as:

SELECT * FROM table WHERE [dbo].[zipDistance] < 25.0
alphadogg
Just realized you did not want SQL? It should be easy enough to convert this to some other syntax. Counterpoint is that, depending on use, this works decent for me in my applications.
alphadogg
but still a good example of my proposal to store the sines and cosines rather than lat/longs. Doing that you can reduce this function to just one trig function per row instead of five - that one being the last cosine term that's dependent on point1 *and* point2.
Alnitak
Interesting. Somehow I'll have to look into that. It may help me with this one huge database I have that collates data in zip areas...
alphadogg
+4  A: 

What I did was to have 2 additional values on my PostalCode class. Whenever I update the Long/Lat on a PostalCode I calculate an X,Y distance from Long 0, Lat 0.

public static class MathExtender
{
    public static double GetDistanceBetweenPoints(double sourceLatitude, double sourceLongitude, double destLatitude, double destLongitude)
    {
     double theta = sourceLongitude - destLongitude;
     double distance =
      Math.Sin(DegToRad(sourceLatitude))
      * Math.Sin(DegToRad(destLatitude))
      + Math.Cos(DegToRad(sourceLatitude))
      * Math.Cos(DegToRad(destLatitude))
      * Math.Cos(DegToRad(theta));
     distance = Math.Acos(distance);
     distance = RadToDeg(distance);
     distance = distance * 60 * 1.1515;
     return (distance);
    }


    public static double DegToRad(double degrees)
    {
     return (degrees * Math.PI / 180.0);
    }

    public static double RadToDeg(double radians)
    {
     return (radians / Math.PI * 180.0);
    }
}

Then I update my class like so:

private void CalculateGridReference()
{
    GridReferenceX = MathExtender.GetDistanceBetweenPoints(0, 0, 0, Longitude);
    GridReferenceY = MathExtender.GetDistanceBetweenPoints(0, 0, Latitude, 0);
}

So now I have an x,y grid distance (in miles) from grid reference 0,0 for each row in my DB. If I want to find all places with 5 miles of a long/lat I would first get the X,Y grid reference (say 25,75) then I would search 20..30, 70..80 in the DB, and further filter the results in memory using

MathExtensder.GetDistanceBetweenPoints(candidate.Lat, candidate.Long, search.Lat, search.Long) < TheRadiusOfInterest

The in DB part is ultra fast, and the in-memory part works on a smaller set to make it ultra accurate.

Peter Morris