views:

116

answers:

3

I would like to know how to optimize the following SQL to let my server load faster and take low usage?

I need to calculate the radius distance for a US ZIP Code to get the result, such as 50 miles from a particular ZIP Code ( using latitude and longitude to calculate ) and to getting how many other data ( e.g. others ZIP Code ) from my database.

Once I get the result ( for example got 350 rows of different ZIP Codes within 50 miles from particular ZIP Code ), I need to passing them into another query to count the total rows and display it in simple and one result for me to read. Here is an example of my query:

SELECT count(*)
FROM
( SELECT b.ID, ROUND((acos(sin(3.142/180*32.91336) * sin(3.142/180*z.latitude) + cos(3.142/180*32.91336) * cos(3.142/180*z.latitude) * cos((3.142/180*z.longitude) - (3.142/180*-85.93836))) * 3959),2) AS distance
  FROM zipcode2business.accountants b LEFT JOIN zipcodeworld.storelocator_us z ON b.ZIPCODE = z.ZIP_CODE
  WHERE z.latitude != 32.91336 AND z.longitude != -85.93836
  AND b.STATE='AL'
  HAVING distance between 0 AND 50) 
as total;

Hopefully I didn't done wrongly, it displays correct result ( 350 rows ), but I need an optimized way to runs it because this SQL gave me a high CPU usage to load. When I do EXPLAIN for this query, it display following:

+----+-------------+-------+--------+------------------+---------+---------+----------------------------+------+------------------------------+
| id | select_type | table | type   | possible_keys    | key     | key_len | ref                        | rows | Extra                        |
+----+-------------+-------+--------+------------------+---------+---------+----------------------------+------+------------------------------+
| 1 | PRIMARY      | NULL  | NULL   | NULL             | NULL    | NULL    |        NULL                | NULL | Select tables optimized away |
| 2 | DERIVED      | b     | ref    | ZIPCODE,STATE    | STATE   | 4       |                            | 3900 | Using where                  |
| 2 | DERIVED      | z     | eq_ref | PRIMARY,LAT_LONG | PRIMARY | 9       | zipcode2business.b.ZIPCODE | 1    | Using where                  |
+----+-------------+-------+--------+------------------+---------+---------+----------------------------+------+------------------------------+
3 rows in set (0.20 sec)

Now, from above explanation, the "Select tables optimized away" in EXTRA is a good thing? Please kindly show me one most perfect optimization SQL to do this query.

A: 

Do you need to be doing all those calculations on the sql server? I generally try to only use SQL for basic CRUD with the data, then all other computations are done outside of SQL. You may want to try retrieving the data you are basing your calculations on, and then doing the actual calculation with whatever is retrieving the data.

Corey Sunwold
+1  A: 

The SQL proper seems fine, the bulk of the CPU time must be spent doing math... There are two avenues for optimization

  • simplify the formulae
  • Filter-out the rows early ("prune"), on the basis of an even simpler calculation

I haven't time at the moment for full details but here the general idea:
It is to approximate the distance from the reference ZipCode location and the other locations, with a cheap (CPU-wise) calculation, and to only do the full math (with a better formula than the one in the original query), for the locations that are below 50 miles (+ a small extra, to account for possible underestimation).

Estimating the distance and pruning
We calculate, once, the distance expessed in miles, corresponding to one degree of latitude and to one degree of longitude, from the reference ZIPcode location; call these MpDLat and MpDLong. Possibly we calculate the fractional value of degree that correspond to our target radius from the reference location; call these Dp50Lat and Dp50Long. Then work with the [absolute value of the] difference between the latitudes and between the longitudes, relative to the reference location, and filter out the locations for which this distance in one direction (lat or long) exceeds our limit. i.e. something like the following

WHERE .... (some other condidtions....) 
   AND (abs(z.latitude - 32.91336) * MpDLat) < 50 
   AND (abs(z.longitude + 85.93836) * MpDLong) < 50 
--or, if we got by the Dp50 values
WHERE .... (some other condidtions....) 
   AND (abs(z.latitude - 32.91336)  < Dp50Lat
   AND (abs(z.longitude + 85.93836) < Dp50Long

Calculating the distance (for there locations not readily filtered)
Depending on the level of precision required it may be acceptable to stick with the MpD factors (I'm guessing errors of a less than a mile or so, for distances in the order of 50 miles, within the continental USA). Then the distances would be calculated as: Sqrt((z.latitude - 32.91336)^2 + (z.longitude + 85.93836)^2 or, if we are only interested in filtering these out without needed the distance per se, we can work off the squares, i.e. ... WHERE (z.latitude - 32.91336)^2 + (z.longitude + 85.93836)^2 < 2500 -- 2509 is 50^2

I'm guessing this type of approximation is acceptable, since much bigger errors are made considering the fact the distance by way of roads (which is likely the one eventually desired) rarely matches that of the 'as-the-crow-flies' ;-) I can calculate precise worse case loss of precision (but again, no time for that now...)

If exact distance is needed, we use a slightly better formula that the original, this one seems directly derived from the spherical law of cosine. We can probably do better.

Variations on the above
The ideas discussed above can be implemented in various fashion, for example with the use of temporary SQL tables, or various constructs for the query(ies), etc.

mjv
A: 

you may select the distance calculation into a temp table and remove the HAVING from your SQL, then do a 2nd SELECT WHERE dist <= 50

this helps saving memory space and possible swap out to temporary disk segments for large numbers of records in your base table

MikeD