views:

123

answers:

3

I have a real estate application and a "house" contains the following information:

house:
- house_id 
- address
- city 
- state
- zip
- price
- sqft
- bedrooms
- bathrooms
- geo_latitude
- geo_longitude

I need to perform an EXTREMELY fast (low latency) retrieval of all homes within a geo-coordinate box.

Something like the SQL below (if I were to use a database):

SELECT * from houses 
WHERE latitude IS BETWEEN xxx AND yyy
AND longitude IS BETWEEN www AND zzz

Question: What would be the quickest way for me to store this information so that I can perform the fastest retrieval of data based on latitude & longitude? (e.g. database, NoSQL, memcache, etc)?

A: 

ThereMongoDB supports geospatial indexes, but there are ways to reduce the computation time for things like this. Depending on how your data is arranged, you can place houses in identifiable 'tiles' and then fetch all houses for a given tile and, from that reduced dataset, sort based on distance from whatever coordinates you have.

Depending on how many tiles there are, you can use bitmasks to find houses that may be near or overlap multiple tiles.

Nick Gerakines
+1  A: 

This is a typical query for a Geographical Information System (GIS) application. Many of these are solved by using quad-tree, or similar spatial, indices. The tiling mentioned is how these often end up being implemented.

If an index containing the coordinates could fit into memory and the DBMS had a decent optimiser, then a table scan could provide a cartesian distance from any point of interest with tolerably low overhead. IF this is too slow, then the query could be pre-filtered by comparing each coordinate axis separately before doing the full distance calculation.

Pekka
A: 

I'm going to assume that you're doing lots more reads than writes, and you don't need to have your database distributed across dozens of machines. If so, you should go for a read-optimized database like sqlite (my personal preference) or mysql, and use exactly the SQL query you suggest.

Most (not all) NoSQL databases end up being overly complicated for queries of this sort, since they're better at looking up exact values in their indexes rather than ranges.

It's nice that you're looking for a bounding box instead of cartesian distance; the latter would be harder for a SQL database to optimize (although you could narrow it to a bounding box, then do the slower cartesian distance calculation).

apenwarr