views:

561

answers:

3

I have a database table of all zipcodes in the US that includes city,state,latitude & longitude for each zipcode. I also have a database table of points that each have a latitude & longitude associated with them. I'd like to be able to use 1 MySQL query to provide me with a list of all unique city/state combinations from the zipcodes table with the total number of points within a given radius of that city/state. I can get the unique city/state list using the following query:

select city,state,latitude,longitude
from zipcodes 
group by city,state order by state,city;

I can get the number of points within a 100 mile radius of a specific city with latitude '$lat' and longitude '$lon' using the following query:

select count(*) 
from points 
where (3959 * acos(cos(radians($lat)) * cos(radians(latitude)) * cos(radians(longitude) - radians($lon)) + sin(radians($lat)) * sin(radians(latitude)))) < 100;

What I haven't been able to do is figure out how to combine these queries in a way that doesn't kill my database. Here is one of my sad attempts:

select city,state,latitude,longitude,
    (select count(*) from points
     where status="A" AND 
          (3959 * acos(cos(radians(zipcodes.latitude)) * cos(radians(latitude)) * cos(radians(longitude) - radians(zipcodes.longitude)) + sin(radians(zipcodes.latitude)) * sin(radians(latitude)))) < 100) as 'points' 
from zipcodes 
group by city,state order by state,city;

The tables currently have the following indexes:

Zipcodes - `zip` (zip)
Zipcodes - `location` (state,city)
Points - `status_length_location` (status,length,longitude,latitude)

When I run explain before the previous MySQL query here is the output:

+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+
| id | select_type        | table    | type | possible_keys          | key                    | key_len | ref   | rows  | Extra                           |
+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+
|  1 | PRIMARY            | zipcodes | ALL  | NULL                   | NULL                   | NULL    | NULL  | 43187 | Using temporary; Using filesort | 
|  2 | DEPENDENT SUBQUERY | points   | ref  | status_length_location | status_length_location | 2       | const | 16473 | Using where; Using index        | 
+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+

I know I could loop through all the zipcodes and calculate the number of matching points within a given radius but the points table will be growing all the time and I'd rather not have stale point totals in the zipcodes database. I'm hoping a MySQL guru out there can show me the error of my ways. Thanks in advance for your help!

+1  A: 

MySQL Guru or not, the problem is that unless you find a way of filtering out various rows, the distance needs to be calculated between each point and each city...

There are two general approaches that may help the situation

  • make the distance formula simpler
  • filter out unlikely candidates to the 100k radius from a given city

Before going into these two avenue of improvement, you should decide on the level of precision desired with regard to this 100 miles distance, also you should indicate which geographic area is covered by the database (is this just continental USA etc.

The reason for this is that while more precise numerically, the Great Circle formula, is very computationally expensive. Another avenue of performance improvement would be to store "Grid coordinates" of sorts in addtion (or instead of) the Lat/Long coordinates.

Edit:
A few ideas about a simpler (but less precise) formula:
Since we're dealing with relatively small distances, (and I'm guessing between 30 and 48 deg Lat North), we can use the euclidean distance (or better yet the square of the euclidean distance) rather than the more complicated spherical trigonometry formulas.
depending on the level of precision expected, it may even be acceptable to have one single parameter for the linear distance for a full degree of longitude, taking something average over the area considered (say circa 46 statute miles). The formula would then become

  LatDegInMi = 69.0
  LongDegInMi = 46.0
  DistSquared = ((Lat1 - Lat2) * LatDegInMi) ^2 + ((Long1 - Long2) * LongDegInMi) ^2

On the idea of a columns with grid info to filter to limit the number of rows considered for distance calculation.
Each "point" in the system, be it a city, or another point (?delivery locations, store locations... whatever) is assigned two integer coordinate which define the square of say 25 miles * 25 miles where the point lies. The coordinates of any point within 100 miles from the reference point (a given city), will be at most +/- 4 in the x direction and +/- 4 in the y direction. We can then write a query similar to the following

SELECT city, state, latitude, longitude, COUNT(*)
FROM zipcodes Z
JOIN points P 
  ON P.GridX IN (
    SELECT GridX - 4, GridX - 3, GridX - 2, GridX - 1, GridX, GridX +1, GridX + 2 GridX + 3, GridX +4
   FROM zipcode ZX WHERE Z.id = ZX.id)
  AND
   P.GridY IN (
    SELECT GridY - 4, GridY - 3, GridY - 2, GridY - 1, GridY, GridY +1, GridY + 2 GridY + 3, GridY +4
   FROM zipcode ZY WHERE Z.id = ZY.id)
WHERE P.Status = A
   AND ((Z.latitude - P.latitude) * LatDegInMi) ^2 
      + ((Z.longitude - P.longitude) * LongDegInMi) ^2 < (100^2)
GROUP BY city,state,latitude,longitude;

Note that the LongDegInMi could either be hardcoded (same for all locations within continental USA), or come from corresponding record in the zipcodes table. Similarly, LatDegInMi could be hardcoded (little need to make it vary, as unlike the other it is relatively constant).

The reason why this is faster is that for most records in the cartesian product between the zipcodes table and the points table, we do not calculate the distance at all. We eliminate them on the basis of a index value (the GridX and GridY).

This brings us to the question of which SQL indexes to produce. For sure, we may want: - GridX + GridY + Status (on the points table) - GridY + GridX + status (possibly) - City + State + latitude + longitude + GridX + GridY on the zipcodes table

An alternative to the grids is to "bound" the limits of latitude and longitude which we'll consider, based on the the latitude and longitude of the a given city. i.e. the JOIN condition becomes a range rather than an IN :

JOIN points P 
  ON    P.latitude > (Z.Latitude - (100 / LatDegInMi)) 
    AND P.latitude < (Z.Latitude + (100 / LatDegInMi)) 
    AND P.longitude > (Z.longitude - (100 / LongDegInMi)) 
    AND P.longitude < (Z.longitude + (100 / LongDegInMi))
mjv
@mjv Any more actionable suggestions such as how to make the distance formula simpler or the best approach to filtering out the unlikely candidates? Thanks for your help!
Russell C.
@mjv Can you provide more details about 'grid coordinates'. What do they look like and how do they help improve performance?
Russell C.
@Russel: see edits about simpler [squared] distance formula, and also about using a Grid system which would allow SQL to use an index for pre-filtering the points to consider for precise distance calculation.
mjv
@mjv Thanks - I think that is getting a lot closer. I've updated the query to: select z.city,z.state,z.latitude,z.longitude,count(p.id) as 'points' from zipcodes z join points p on p.latitude > (z.latitude-(100/69.0)) AND p.latitude < (z.latitude+(100/69.0)) AND p.longitude > (z.longitude-(100/46.0)) AND p.longitude < (z.longitude+(100/46.0)) AND p.status="A" group by z.city,z.state order by z.state,z.city; However, the performance isn't much better than my original combined query. Any ideas on what indexing I might be able to do in order to improve it?
Russell C.
@Russell, For sure, a status+latitude+longitude index on points would greatly help. Another consideration is to pre-compute the 100 miles range (100/69.0 and such), altough I doubt this would have a significant impact.
mjv
@mjv Thanks for all the help. Unfortunately it seems that even with all the optimization the query is still killing the database. I think it's going to require calculating the counts on a regular basis or as you suggested using some kind of simplification such as grids. Thanks again!
Russell C.
@Russel, sorry it wasn't more conclusive. For future reference, do you have an indication about the improvement all these changes allowed (just an order of magnitude: were were say cutting the time in half, but it not being sufficient however or was it just say 15% ... ? Also do you have a rough description of the database's statistical profile (is it say 4 M rows (points) total, with on average 15,000 within 100 mi of a citi or is it...) ?
mjv
A: 
SELECT * FROM tblLocation 
    WHERE 2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)

where the 2 > part would be the number of parallels away and 40 and -90 are lat/lon of the test point

Sorry I didn't use your tablenames or strutures, i just coppied this out of one of my stored procedures I have in one of my databases.

If I wanted to see the number of points in a zip code I suppose I would do something like this:

SELECT ParcelZip, COUNT(LocationID) AS LocCount FROM tblLocation 
    WHERE 2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)
    GROUP BY ParcelZip

Getting the total count of all locations in the range would look like this:

SELECT COUNT(LocationID) AS LocCount FROM tblLocation 
    WHERE 2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)

A cross join may be inefficient here since we are talking about a large quantity of records but this should do the job in a single querry:

SELECT ZipCodes.ZipCode, COUNT(PointID) AS LocCount FROM Points
    CROSS JOIN ZipCodes
    WHERE 2 > POWER(POWER(Points.Latitude - ZipCodes.Latitude, 2) + POWER(Points.Longitude - ZipCodes.Longitude, 2), .5)
    GROUP BY ZipCodeTable.ZipCode
Jrud
I'm assuming this would be a replacement for the subquery?
Russell C.
Oh yes, I hadn't even noticed the second part of the problem. Sorry! I added another part on to count the locations in each zip code
Jrud
I think I'm missing something Jrud since I'm not seeing how your second query would return the number of points that is within 2 paralells of the 40 and -90 lat/lon location. It seems that it would only return a count of cities near a specific city from the same table - not 2 separate tables.
Russell C.
I may not be getting exactly how your points table and zip codes tables are linked. Do you know what the lat/lon coords of each zip code are? or do you know what zip codes the points are in?
Jrud
I know the lat/lon of each zipcode in the zipcodes table and the lat/lon of each point in the points table. From that I need the total number of points within a given radius for each specific zipcode - all in 1 query.
Russell C.
alright, added another one... is that any closer?
Jrud
It is but unfortunately that query has the exact same performance issues as the combined query I posted earlier. Thanks for trying though.
Russell C.
A: 

When I do these type of searches, my needs allow some approximation. So I use the formula you have in your second query to first calculate the "bounds" -- the four lat/long values at the extremes of the allowed radius, then take those bounds and do a simple query to find the matches within them (less than the max lat, long, more than the minimum lat, long). So what I end up with is everything within a square sitting inside the circle defined by the radius.

Devin Ceartas
Thanks for the suggestion Devin. With that approach I'm assuming I'd need to do a DB query for each city/state to figure out the bounds which would amount to 10,000s of queries. Can you think of anyway to combine that into a single query?
Russell C.