views:

455

answers:

4

I have a list of records in my database and each record is associated with a zip code.

What is the "best-practice" for querying all the records in my database to find all entries that are within n miles of another zip code?

Each zip code has a lat/long associated with it in the database so I know I'll have to use that. However, I can't imagine running any sort of distance formula on each pair of zip codes, converting to miles and rejecting those that aren't within my radius.

That seems awfully computationally expensive for such a common query.

I've also considered doing an all-pairs pre-computation but it seems too large to consider also. There are approximately ~40,000 zip codes in the US. So, an all pairs database of each zip code would be (40,000)^2, or 1.6billion entries.

I know this is a common problem on websites so hopefully someone can point me in the right direction for the best way. I'm using SQL Server 2008 and if there are pre-built solutions out there then great, because I really don't want to re-invent the wheel in this instance.


Related Question: Getting all zip codes within radius (this didn't help me)
Also, I know of this SourceForge project but it is derelict and no longer in use.

A: 

This is in fact a very hard problem to solve. I would recommend you do some cheating by pre-creating a database. Create a grid of whatever kind of closeness you need to find, for example, take every 10 miles in each direction, add an entry to the database for each zip for that grid point and the distance, and then when a query comes in, you first translate the query point to one of your grid points. Now you can look up the distance quite easily.

This solution basically means trading space for time, so you can get a quite large database quickly. The good news is: it is very easy data to index.

Ola Bini
An all pairs pre-computation would be kind of big. Aprox. 40,000 us zip codes, so (40,000)^2 for each range would be a lot of database entries.
Simucal
That would be aprox ~1.6billion entries for each range... I don't know if that would be an option.
Simucal
Actually what Ola Bini suggests is that you can shorten the amount of entries greatly if you can limit the maximum distance between the zip codes (10 miles in his example)
tehvan
+7  A: 

I would run a query that returned all records bracketed in the square envelope encompasing the radial search circle (minlat < lat < maxlat and minlong < long < maxlong), and then post-process this to return only the points within the radius circle itself. (Make sure your lat and long fields are indexed).

If you wanted to get fancy, SQL server supports spatial indexes.

mackenir
dang: beat me to it!
Mitch Wheat
+3  A: 

I run a site that needs to run this query about once per second per user, and here's what I've learned:

First off, make sure your location table has indexes on Lat and Lon. That's the difference between 20ms and 15s response times if you have millions of records.

Start off with a bounding-box query to get a set of locations to work with. Then calculate distances on those, sort, and if you're fussy about accuracy, filter a few out.

Frankly, I wouldn't worry about pre-computing anything. Like I say, I run this type of query against a location table with 6,000,000 entries, and it usually returns results in <50ms. Depending on your needs, that really aught to be fast enough.

Good luck!

Jason Kester
Thanks for your personal info on this issue. I appreciate it.
Simucal
A: 

You should look at GeoNames.org. You can query theirwebservice for what you are looking for, or you can dl thier database.

Muad'Dib