views:

939

answers:

7

I have a MySQL database. I store homes in the database and perform literally just 1 query against the database, but I need this query to be performed super fast, and that's to return all homes within a square box geo latitude & longitude.

SELECT * FROM homes 
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

How is the best way for me to store my geo data so that I can perform this query of displaying all home within the geolocation box the quickest?

Basically:

  • Am I using the best SQL statement to perform this query the quickest?
  • Does any other method exist, maybe not even using a database, for me to query the fastest way a result of homes within a boxed geolocation bounds?

In case it helps, I've include my database table schema below:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

UPDATE

I understand spatial will factor in the curvature of the earth but I'm most interested in returning geo data the FASTEST. Unless these spatial database packages somehow return data faster, please don't recommend spatial extensions. Thanks

UPDATE 2

Please note, no one below has truly answered the question. I'm really looking forward to any assistance I might receive. Thanks in advance.

A: 

If you really need to go for performance you can define bounding boxes for your data and map the pre-compute bounding boxes to your objects on insertion and use them later for queries.

Here is a great article that describes this approach in detail. If the resultsets are reasonably small you could still do accuracy corrections in the application logic (easier to scale horizontal than a database) while enabling to serve accurate results.

There are also links to source code like Bret Slatkin's geobox.py which contains great documentation.

I know the article is google app engine specific but the solutions is applicable to relational databases as well.

I would still recommend checking out PostgreSQL and PostGIS in comparison to MySQL if you intend to do more complex queries in the foreseeable future.

tosh
A: 

The indices you are using are indeed B-tree indices and support the BETWEEN keyword in your query. This means that the optimizer is able to use your indices to find the homes within your "box". It does however not mean that it will always use the indices. If you specify a range that contains too many "hits" the indices will not be used.

Peter Lindqvist
So, would using min_latitude >= ??? max_latitude <= ??? be better instead of using BETWEEN?
HankW
From the manual: This is equivalent to the expression (min <= expr AND expr <= max)
Peter Lindqvist
what do you mean if there are too many "hits" the indexes won't be used? I don't understand
HankW
If you specify an area that contains too many records the index wont be used.
Peter Lindqvist
A: 

This looks pretty fast. My only concern would be that it would use an index to get all the values within 3 miles of the latitude, then filter those for values within 3 miles of the longitude. If I understand how the underlying system works, you can only use one INDEX per table, so either the index on lat or long is worthless.

If you had a large amount of data, it might speed things up to give every 1x1 mile square a unique logical ID, and then make an additional restriction on the SELECT that (area="23234/34234" OR area="23235/34234" OR ... ) for all the squares around your point, then force the database to use that index rather than the lat and long. Then you'll only be filtering much less square miles of data.

Christopher Gutteridge
One index per table? Do you confuse it with primary key?
Peter Lindqvist
I mean that when you do a SELECT it only uses one index per table in the SELECT.
Christopher Gutteridge
Ah.. That's a good point, but do you think creating a composite index would make a difference?
Peter Lindqvist
A (more sophisticated) composite index is what spatial indexes do and it is faster if there is a lot of data.
Mark Thornton
A: 

I had the same problem, and wrote a 3 part blogpost. This was faster than the geo index.

Intro, Benchmark, SQL

Evert
Evert, how do you implemented Morton (z-value)? You're second article just jumps in and says nothing about how you computed the value
HankW
The third one does actually. There's a stored procedure
Evert
What I don't understand is that when I perform the SELECT, how do I know what the morton value is to select on?
HankW
Good question. You should make sure that for every row in your table, you also store this morton value. You can do this with an AFTER INSERT (along with AFTER UPDATE).When you SELECT you can simply do a BETWEEN getGeoMorton(lat1,lng1) AND getGeoMorton(lat2,lng2).Because the morton select will be an approximation, and can include a lot of items outside the area, you must also add a standard where clause for just the latitude and longitude bounding box.The real trick is that you are now using a BTREE for much smaller areas rather than JUST the latitude or longitude.
Evert
A: 

Homes? You probably won't even have ten thousand of them. Just use an in-memory index like STRTree.

novalis
+1  A: 

There is a good paper on MySQL geolocation performance here:

http://www.scribd.com/doc/2569355/Geo-Distance-Search-with-MySQL

EDIT Pretty sure this is using fixed radius. Also I am not 100% certain the algorithm for calculating distance is the most advanced (i.e. it'll "drill" through Earth).

What's significant is that the algorithm is cheap to give you a ball park limit on the number of rows to do proper distance search.

Igor Zevaka
It appears that using the stored procedure on slide #14 is promising but it's unclear to me if that assumes a fixed radius. Do you know if the radius is fixed or not? I want to be able to pass in the box corner (radius)
HankW
I need to be able to pass in as an argument the boxed radius. Do you think I can then use the linked document as such then?
HankW
A: 

Sticking with your current approach there is one change you should make, Rather than indexing geolat and geolong separately you should have a composite index:

KEY `geolat_geolng` (`geolat`, `geolng`),

Currently your query will only be taking advantage of one of the two indexes.

Ben