views:

1567

answers:

5

I have a mysql (5.0.22) myisam table with roughly 300k records in it and I want to do a lat/lon distance search within a five mile radius.

I have an index that covers the lat/lon fields and is fast (milisecond response) when I just select for lat/lon. But when I select for additional fields in the table is slows down horribly to 5-8 seconds.

I'm using myisam to take advantage of fulltext search. The other indexes perform well (e.g. select * from Listing where slug = 'xxxxx').

How can I optimize my query, table or index to speed things up?

My schema is:

CREATE TABLE  `Listing` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `name` varchar(125) collate utf8_unicode_ci default NULL,
  `phone` varchar(18) collate utf8_unicode_ci default NULL,
  `fax` varchar(18) collate utf8_unicode_ci default NULL,
  `email` varchar(55) collate utf8_unicode_ci default NULL,
  `photourl` varchar(55) collate utf8_unicode_ci default NULL,
  `thumburl` varchar(5) collate utf8_unicode_ci default NULL,
  `website` varchar(85) collate utf8_unicode_ci default NULL,
  `categoryid` int(10) unsigned default NULL,
  `addressid` int(10) unsigned default NULL,
  `deleted` tinyint(1) default NULL,
  `status` int(10) unsigned default '2',
  `parentid` int(10) unsigned default NULL,
  `organizationid` int(10) unsigned default NULL,
  `listinginfoid` int(10) unsigned default NULL,
  `createuserid` int(10) unsigned default NULL,
  `createdate` datetime default NULL,
  `lasteditdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  `lastedituserid` int(10) unsigned default NULL,
  `slug` varchar(155) collate utf8_unicode_ci default NULL,
  `aclid` int(10) unsigned default NULL,
  `alt_address` varchar(80) collate utf8_unicode_ci default NULL,
  `alt_website` varchar(80) collate utf8_unicode_ci default NULL,
  `lat` decimal(10,7) default NULL,
  `lon` decimal(10,7) default NULL,
  `city` varchar(80) collate utf8_unicode_ci default NULL,
  `state` varchar(10) collate utf8_unicode_ci default NULL,
  PRIMARY KEY  (`id`),
  KEY `idx_fetch` USING BTREE (`slug`,`deleted`),
  KEY `idx_loc` (`state`,`city`),
  KEY `idx_org` (`organizationid`,`status`,`deleted`),
  KEY `idx_geo_latlon` USING BTREE (`status`,`lat`,`lon`),
  FULLTEXT KEY `idx_name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=DYNAMIC;

My query is:

SELECT Listing.name, Listing.categoryid, Listing.lat, Listing.lon
, 3956 * 2 * ASIN(SQRT( POWER(SIN((Listing.lat - 37.369195) * pi()/180 / 2), 2) + COS(Listing.lat * pi()/180) * COS(37.369195 * pi()/180) * POWER(SIN((Listing.lon --122.036849) * pi()/180 / 2), 2) )) rawgeosearchdistance
FROM Listing
WHERE
    Listing.status = '2'
    AND ( Listing.lon between -122.10913433498 and -121.96456366502 )
    AND ( Listing.lat between 37.296909665016 and 37.441480334984)
HAVING rawgeosearchdistance < 5
ORDER BY rawgeosearchdistance ASC;

Explain plan without geosearch:

    +----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+
    | id | select_type | table      | type  | possible_keys   | key             | key_len |ref | rows | Extra       |
    +----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+
    |  1 | SIMPLE      | Listing    | range | idx_geo_latlon  | idx_geo_latlon  | 19      | NULL |  453 | Using where |
    +----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+

Explain plan with geosearch:

+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+
| id | select_type | table      | type  | possible_keys   | key             | key_len | ref  | rows | Extra                       |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | Listing    | range | idx_geo_latlon  | idx_geo_latlon  | 19      | NULL |  453 | Using where; Using filesort |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+

Here's the explain plan with the covering index. Having the columns in the correct order made a big difference:

+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+
| id | select_type | table  | type  | possible_keys | key           | key_len | ref  | rows   | Extra                                    |
+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+
|  1 | SIMPLE      | Listing | range | idx_geo_cover | idx_geo_cover | 12      | NULL | 453     | Using where; Using index; Using filesort |
+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+

Thank you!

A: 

You really should avoid doing that much math in your select statement. That's probably the source of a lot of your slowdowns. Remember, SQL is a query language; it's really not optimized for trigonometric functions.

SQL will be faster and your overall results will be faster if you do a very naive distance search (which will return more results) and then winnow your results.

If you want to be using distance in your query, at the very least, use a squared distance calculation; sqrt calculations are notoriously slow. Squared distance is much easier to use. A squared distance calculation is simply using the square of the distance instead of the distance; it is much simpler. For cartesian coordinate systems, since the sum of the squares of the short sides of a right triangle equals the square of the hypotenuse, it's easier to calculate the square distance (just sum the two squares) than it is to calculate the distance; all you have to do is make sure that you're squaring the distance you want to compare to (so instead of finding the precise distance and comparing that to your desired distance (let's say 5), you find the square distance, and compare that to the square of the desired distance (25, if your desired distance was 5).

McWafflestix
Do you know where I can find more information on the squared distance calculation?
Jeff
@Jeff: I will add it to the answer.
McWafflestix
Here's a simple suggestion: define your great-circle distance calculation as a stored function, and rework your query so it's the third AND clause after your lat and long. Do your lat query first, then your long query. A nautical mile north or south is approximately one degree of latitude all over the world, but the distance of a degree of longitude varies by latitude.
Ollie Jones
+1  A: 

You are probably using a 'covering index' in your lat/lon only query. A covering index occurs when the index used by the query contains the data that you are selecting for. MySQL only needs to visit the index and never the data rows. See this for more info. That would explain why the lat/lon query is so fast.

I suspect that the calculations and the sheer number of rows returned, slows down the longer query. (plus any temp table that has to be created for the having clause).

jonstjohn
You're close to the problem relating to a covering index. The index I had didn't cover enough columns. I expanded it to cover all the columns needed and it was faster, but still takes 1 - 1.7s.I also had to reduce the charset to latin1 if I wanted to include information like phone, email, website, etc. (1 byte vs 3 bytes for utf8)
Jeff
A: 

Depending on the number of your listings could you create a view that contains

Listing1Id, Listing2ID, Distance

Basically just have all of the distances "pre-calculated"

Then you could do something like:

Select listing2ID from v_Distance d where distance < 5 and listing1ID = XXX

+3  A: 

I think you really should consider the use of PostgreSQL (combined with Postgis).

I have given up on MySQL for geospatial data (for now) because of the following reasons:

  • MySQL only supports spatial datatypes / spatial indexes on MyISAM tables with the inherent disadvantages of MyISAM (concerning transactions, referential integrity...)
  • MySQL implements some of the OpenGIS specifications only on a MBR-basis (minimum bounding rectangle) which is pretty useless for most serious geospatial querying-processing (see this link in the MySQL manual). Chances are you will need some of this functionality sooner of later.

PostgreSQL/Postgis with proper (GIST) spatial indexes and proper queries can be extremely fast.

Example: determining overlapping polygons between a 'small' selection of polygons and a table with over 5 million (!) very complex polygons, calculate the amount of overlap between these results + sort. Average runtime: between 30 and 100 milliseconds (This particular machine has a lot of RAM off course. Don't forget to tune your PostgreSQL install... (read the docs)).

ChristopheD
+1, yes this is a spatial problem, so it asks for a spatial solution, as well.
none
A: 

When I implemented geo radius search I just loaded all of the us Zipcodes into memory with their lat long and then used my starting point with radius to get a list of zipcodes in the radius and then used that for my db query. Of course I was using solr to do my searching because the search space was in the 20 million row range but the same principles should apply. Apologies for the shallowness of this response as I'm on my phone.

Hardwareguy