views:

376

answers:

1

I'm using geokit (acts_as_mappable) in a rails application, and the performance of radial or bounds searches degrades considerably when there is a large number of models (I've tried with 1-2million but the problem no doubt kicks in earlier than this).

Geokit does all its calculations based on lat and lng columns in the table (latitude and longitude). To improve performance geokit will typically add a bounding box 'where' clause, with the intent being to use a combined index on latitude and longitude to improve performance. However it is still incredibly slow with large numbers of models, and it seems to me that the bounding box clause should help a lot more than it does.

So my question is, is there a way to make mysql make better use of the combined lat/lng index or otherwise improve the performance of geokit sql queries? Or, can the combined index for lat/lng be made more helpful?

edit: I've got this working with rails now and written the solution up in more detail here

More Background

For example, this query finds all places within 10 miles of a given point. (I've added .length just to determine how many results come back - there are nicer ways to say this in geokit, but I wanted to force a more typical SQL query).

Place.find(:all,:origin=>latlng,:within=>10).length

It takes about 14s on a mac mini. Here is the explain plan

mysql> explain SELECT *, (ACOS(least(1,COS(0.898529183781244)*COS(-0.0157233221653665)*COS(RADIANS(places.lat))*COS(RADIANS(places.lng))+    ->  COS(0.898529183781244)*SIN(-0.0157233221653665)*COS(RADIANS(places.lat))*SIN(RADIANS(places.lng))+    ->  SIN(0.898529183781244)*SIN(RADIANS(places.lat))))*3963.19)
    ->  AS distance FROM `places` WHERE (((places.lat>51.3373601471464 AND places.lat<51.6264998528536 AND places.lng>-1.13302245886176 AND places.lng<-0.668737541138245)) AND ( (ACOS(least(1,COS(0.898529183781244)*COS(-0.0157233221653665)*COS(RADIANS(places.lat))*COS(RADIANS(places.lng))+
    ->  COS(0.898529183781244)*SIN(-0.0157233221653665)*COS(RADIANS(places.lat))*SIN(RADIANS(places.lng))+
    ->  SIN(0.898529183781244)*SIN(RADIANS(places.lat))))*3963.19)
    ->  <= 10)) 
    -> ;
+----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+
| id | select_type | table  | type  | possible_keys               | key                         | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | places | range | index_places_on_lat_and_lng | index_places_on_lat_and_lng | 10      | NULL | 87554 |   100.00 | Using where | 
+----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+

So mysql is examining 87554 rows even though the number of places in the result is 1135 (and the number of places actually in the bounding box is just 1323).

These are the stats on the index (which is made with a rails migration *add_index :places, [:lat, :lng]*):

| Table  | Non_unique | Key_name                         | Seq_in_index | Column_name      | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
| places |          1 | index_places_on_lat_and_lng      |            2 | lng              | A         |     1373712 |     NULL | NULL   | YES  | BTREE      |         |

Nor does it seem to be related to the trig calculations, as doing a similar query for a bounding box results in a much simpler query but it performs similarly badly:

Place.find(:all,:bounds=>GeoKit::Bounds.from_point_and_radius(latlng,10)).length

Gives a similar explain plan:

   mysql> explain SELECT * FROM `places` WHERE ((places.lat>51.3373601471464 AND places.lat<51.6264998528536 AND places.lng>-1.13302245886176 AND places.lng<-0.668737541138245)) ;
    +----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+
    | id | select_type | table  | type  | possible_keys               | key                         | key_len | ref  | rows  | filtered | Extra       |
    +----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+
    |  1 | SIMPLE      | places | range | index_places_on_lat_and_lng | index_places_on_lat_and_lng | 10      | NULL | 87554 |   100.00 | Using where | 
    +----+-------------+--------+-------+-----------------------------+-----------------------------+---------+------+-------+----------+-------------+
+2  A: 

Plain B-Tree indexes are not too good for the queries like this.

For your query, the range access method is used on the following condition:

places.lat > 51.3373601471464 AND places.lat < 51.6264998528536

, this doesn't even take lon into account.

If you want to use spatial abilities, you should keep your places as Points, create a SPATIAL index of them and use MBRContains to filter the bounding box:

ALTER TABLE places ADD place_point GEOMETRY

CREATE SPATIAL INDEX sx_places_points ON places (place_point)

UPDATE  places
SET     place_point = Point(lat, lon)

SELECT  *
FROM    places
WHERE   MBRContains(LineString(Point(51.3373, -1.1330), Point(51.6264, -0.6687)), place_point)
        AND -- do the fine filtering here

Update:

CREATE TABLE t_spatial (id INT NOT NULL, lat FLOAT NOT NULL, lon FLOAT NOT NULL, coord GEOMETRY) ENGINE=MyISAM;

INSERT
INTO    t_spatial (id, lat, lon)
VALUES  (1, 52.2532, 20.9778);

UPDATE  t_spatial
SET     coord = Point(lat, lon);

This works for me in 5.1.35.

Quassnoi
That's interesting - what kind of index should there be in that case?
frankodwyer
thanks - that sounds like it will work a lot better and I will try it out. is there also a way to improve the current query without using spatial (as geokit currently doesn't use mysql spatial stuff)?
frankodwyer
interestingly if I run this query SELECT * FROM `places` WHERE ((places.lat>51.3373601471464 AND places.lat<51.6264998528536)); it only returns 42078 rows! So it looks like mysql isn't making a great job of that part either.
frankodwyer
@frankodwyer: without rewriting the query not much can be done. If `MySQL` returns `42078` rows, that just means most your places are located between `51.3373` and `51.6264` (which covers the whole city of London), so you can hardly blame `MySQL` for that, it just returns back what's being put :)
Quassnoi
Yes I wasn't objecting to it returning 40K rows, but the fact that it though it should examine 80K of them to do it. Also in the bounding box case that is more usual, there should only be 1K or so results even in this dataset.
frankodwyer
`@frankodwyer`: the number of rows you see in the `EXPLAIN` is only an estimate, calculated from `MIN`, `MAX` and cardinality. In fact, `MySQL` examines only as many rows as you range query returns.
Quassnoi
Ah OK, makes more sense. It still seems strange to me that it cannot be persuaded to optimise using both lat and lng in the original query. In the case of bbox I don't see what's unusual about it being spatial - it seems like a general issue that would apply to any query with two ranges (e.g. age and height in a DB of the general population)?
frankodwyer
`@frankodwyer`: it is a general issue (range filtering on two columns) indeed. A `B-Tree` index cannot cope with it by design. It's a well-known problem: how to find all networks an IP address belongs to, how to find all date ranges that contain a certain point in time, etc. That's exactly what `R-Tree` (`SPATIAL`) indexes are for. See these articles in my blog for similar problems: http://explainextended.com/2009/04/04/banning-ips/, http://explainextended.com/2009/07/01/overlapping-ranges-mysql/
Quassnoi
Ah!! I see...great blog by the way. I've been trying to the spatial suggestion but stuck on getting the following error: UPDATE places SET place_point=Point(lat,lng);ERROR 1416 (22003): Cannot get geometry object from data you send to the GEOMETRY field. I've also tried 'alter table places add place_point geometry not null' as I think this is required for a spatial index. Any idea what is the cause of this message?
frankodwyer
more info on that - both lat and lng are not null also. altho that wasn't enforced by the DB, I also added the constraint and it makes no difference.
frankodwyer
`@frankodwyer`: could you please add `ORDER BY id LIMIT 10` to your update query and isolate this very row which fails by changing `LIMIT`?
Quassnoi
it fails on the very first row, which has lat=52.2532 and lng=20.9778
frankodwyer
I updated the post with a script that works for me. Could you please post your table definition? Just issue `SHOW CREATE TABLE places` and post the output.
Quassnoi
Looks like a version thing - your script doesn't work for me on 5.0.51a or 5.1.35, but it did work when I installed 5.1.37 on my test box. I will try some performance tests now on 5.1.37, then try to figure out what the problem is on the older versions.
frankodwyer
The performance is massively improved on 5.1.37 using this approach - execution times of less than a second. Also, mysql doesn't seem to object to this for the older sql versions: update t_spatial set coord=POINTFROMTEXT(CONCAT('POINT(',lat,' ',lon,')'));, and create spatial index then works on the table also.
frankodwyer