views:

127

answers:

5

Quest

The query selects all the points beginning with "Vancouver" and are within a 5 minute area from the center of all locations beginning with "Vancouver". For example, Vancouver South Fraser, Vancouver Fairview, and Vancouver Ballantree Place W have latitudes and longitudes within 5 minutes of their average latitude and longitude. The latitudes and longitudes are stored as (4915, 12311) integer pairs (meaning 49.15'N and 123.11'W).

SQL Code

The following SQL abomination does the trick:

SELECT
  NAME
FROM
 STATION
WHERE
      DISTRICT_ID = '110'
  AND NAME LIKE 'Vancouver%'
  AND LATITUDE BETWEEN
    (SELECT round((min(LATITUDE) + max(LATITUDE)) / 2)-5 FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%')
    and
    (SELECT round((min(LATITUDE) + max(LATITUDE)) / 2)+5 FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%')
  AND LONGITUDE BETWEEN
    (SELECT round((min(LONGITUDE) + max(LONGITUDE)) / 2)-5 FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%')
    and
    (SELECT round((min(LONGITUDE) + max(LONGITUDE)) / 2)+5 FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%')
ORDER BY
  LATITUDE

Question

How can this query be simplified to remove the redundancy, without using a view?

Restrictions

The database is MySQL, but ANSI SQL is always nice.

Thank you!

+1  A: 

Use a common table expression ...

with cte as
 (  SELECT round((min(LATITUDE) + max(LATITUDE)) / 2)-5 min_lat
           , round((min(LATITUDE) + max(LATITUDE)) / 2)+5 max_lat
           , round((min(LONGITUDE) + max(LONGITUDE)) / 2)-5 min_long
           , round((min(LONGITUDE) + max(LONGITUDE)) / 2)+5 max_long
           , DISTRICT_ID
           ,  'Vancouver%' AS NAME 
    FROM STATION 
    WHERE DISTRICT_ID = '110' 
AND NAME LIKE 'Vancouver%'
group by DISTRICT_ID,  'Vancouver%')
SELECT
  NAME
FROM
 STATION , cte
WHERE
      station.DISTRICT_ID = cte.DISTRICT_ID
  AND station.NAME LIKE cte.NAME
  AND station.LATITUDE BETWEEN cte.min_lat AND cte.max_lat
   AND station.LONGITUDE BETWEEN cte.min_long AND cte.max_long
ORDER BY
  station.LATITUDE

NB: I don't have access to a database right now, so I haven't been able to test this. Consequently I can't guarantee that it is works. I will test it when I can. The principle holds.

APC
I never remember common table expressions. Thank you.
Dave Jarvis
@APC - AFAIK, MySql does not support CTEs.
Thomas
@Thomas - sorry, I'm not really a MySQL chap but Dave did ask for a ANSI SQL solution :) Anyway, you are right: MySQL does not currently support the WITH clause, although the last message in this Bug report suggests it might be on the horizon http://bugs.mysql.com/bug.php?id=16244
APC
A: 

I'm presuming that the original query provided an accurate result to your purpose. If that is the case, then you can consolidate the query by putting the calculation of the end points into a subquery.

Select ...
From Station As S
    Cross Join  (
                Select Round( (Min(S1.Latitude)  + Max(S1.Latitude)) / 2 ) As Latitude
                    , Round( (Min(S1.Longitude)  + Max(S1.Longitude)) / 2 ) As Longitude
                From Station As S1
                Where S1.District_Id = '110'
                    And S1.Name Like 'Vancouver%'
                ) As S2
Where S.District_Id = '110'
    And S.Name Like 'Vancouver%'
    And  S.Latitude Between (S2.Latitude - 5) And (S2.Latitude + 5)
    And  S.Longitude Between (S2.Longitude - 5) And (S2.Longitude + 5)
Order By S.Latitude
Thomas
+2  A: 
select 
  name
from 
  (select 
    round((min(LATITUDE) + max(LATITUDE)) / 2) as LATITUDE,
    round((min(LONGITUDE) + max(LONGITUDE)) / 2) as LONGITUDE
   from STATION 
   where DISTRICT_ID = '110' 
     AND NAME LIKE 'Vancouver%') AS center
  inner join STATION s
where
  s.DISTRICT_ID = '110' 
  and s.NAME like 'Vancouver%'
  and s.LATITUDE between center.LATITUDE - 5 and center.LATITUDE + 5
  and s.LONGITUDE between center.LONGITUDE - 5 and center.LONGITUDE + 5
Senseful
This is exactly what I was trying to wrap my brain around. Thank you.
Dave Jarvis
Hm, ok, but not optimal - you can avoid the join if you rewrite you query as ABS(latitude-(SELECT avg(latitude)) < 10 (and similar for longitude). Also, notice that that will not give results within radius, but within rectangle (as stated in my answer).
Unreason
@eagle: -1, as per Unreason you're finding stuff in a projected square, not a circle, but you also don't solve the problem of the intger representation of the longitude/lattitide, and your calculation of the centre is wrong - See Unreason's answer elsewhere
symcbean
+2  A: 

First of all do notice that your definition 'within 5 minutes of each other' does not define a single solution and that your (MIN()+MAX())/2 is not average but simply middle of min and max. You might be looking for AVG() in your subqueries.

Secondly you are not getting results within 5 secs from each other, but entries whose longitude and latitude are at most 10 secs (which on diagonal can be closer to 14).

In mysql you can use session variables such as:

SET @avg_lat := (SELECT round(avg(LATITUDE)) FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%');
SET @avg_long := (SELECT round(avg(LONGITUDE)) FROM STATION WHERE DISTRICT_ID = '110' AND NAME LIKE 'Vancouver%');

SELECT
  NAME
FROM
 STATION
WHERE
  DISTRICT_ID = '110'
  AND pow(LATITUDE-@avg_lat,2)+pow(LONGITUDE-@avg_long,2)<25
ORDER BY
  LATITUDE

Even though it is not necessary (as in the query written like above both variables occur only once).

EDIT: Oops, missread the question. It is the radius of center - so replace 25 with 100 (and while at it decide if you want to use less or equal). Also, if center is the center of bounding box then your (min()+max())/2 is right formula and not my suggestion. Still 'center of all locations' is a bit vague so I am leaving my answer (it is easy to change it).

EDIT2: Just noticed that the units in the my query are not right, if the latitude is stored in centiminutes, then the comparison should be to centiminutes too (10*100)^2=1000000

And finally, your decision to stick with (min()+max())/2 will lead to instances where you might have a single row which is far toward max and min which could make the query miss any of the results (and can happen, usually all the locations with similar names are next to each other, but it is not uncommon to have another location starting with the same name which is an isolated place far from the conglomerate of locations)

As for 5 minute area, to be completely precise better say it is 10x10 minute area, that's what the queries would return.

EDIT3: The formula used above for the distance is not very precise if you move away from the equator. Here is a better approximations of distance formula For serious work you might need something like this

Unreason
Thanks for this. I might have to come back and re-examine the requirements.
Dave Jarvis
This is a great answer and will ultimately be what I need. I have decided to use a different data set to get the lat/long of the city and then use this code to find all stations within a given radius.
Dave Jarvis
Ok, the final part is actually this: http://dev.mysql.com/doc/refman/5.1/en/spatial-extensions.html, if your application does spend a lot of time dealing with spatial data look into it (caveat, I have not used spatial extension on mysql, only on postgres and oracle, but I know these are relatively new features, so watch your steps; still the difference in performance between spatially indexed data and regularly indexed data can easily reach 5-6 orders of magnitude)
Unreason
I realized, as you hinted, that the general idea was flawed. Turns out that MaxMind.com released their city and lat/long database. I have since integrated it; users select the city name, then the SQL matches the selected city's location against station locations. It is a clean solution. Thanks again for all your help!
Dave Jarvis
A: 

Whatever happened to good old Pythagoras (OK I know it doesn't really apply to curved surfaces - but should be a good enough approximation). If you are looking for the centre (actually, the centroid based on the interpretation applied by physicist rather than geometrists) of the set of coordinate pairs, then you should not be using MIN and MAX, although you might consider bounding the search based on the MIN and MAX). The only remaining fly in the ointment is that you store an integer representation of a string representation of a coordinate angle.

Consider:

SELECT b.name
FROM
(SELECT AVG(CALC(a.lattitude)) AS c_lat, AVG(CALC(a.longitude)) AS c_long
  FROM station a 
  WHERE a.district_id='110'
  AND a.name like 'VANCOUVER%'
) AS ilv,
station b
WHERE b.district_id='110'
AND b.name LIKE 'VANCOUVER%'
AND POW(ilv.c_lat-CALC(b.lattitude),2)
     + POW(olv.c_long-CALC(b.longitude),2)<=25;

Where the CALC function converts the stored value into a longitude/lattitude in minutes, i.e.

CALC(x)=(FLOOR(x/100)*60+MOD(x,100))

C.

symcbean
I wouldn't say you need CALC, the values are given in centiminutes, just divide them with 100. Otherwise I guess examples would say 49 degrees 15 minutes and not 49 point 15 minutes.
Unreason
Yes - you're right - I'm just used to seeing coordinates being in degrees/minutes/seconds
symcbean