views:

124

answers:

3

Suppose I have a people and their GPS coordinates:

User1, 52.99, -41.0
User2, 91.44, -21.4
User3, 5.12, 24.5
...

My objective is: Given a set of coordinates,

  1. Out of all those users, find the ones within 20 meters. (how to do a SELECT statement like this?)
  2. For each of those users, get the distance.

As you probably guessed, these coordinates will be retrieved from a mobile phone. The phones will update their longitude/latitude every 10 seconds, as well as get that list of users <20 meters. It's dynamic.

I would like the best way to do this so that it can scale.

  • Would you store the coordinates in a database, and update it every 10 seconds? (If you store it in a database...how would you calculate it...)
  • How would you do this so it can scale?

By the way, there is already a formula that can calculate the distance between 2 coordinates http://www.johndcook.com/python_longitude_latitude.html. I just need to know what's the best way to do this technically (Trees, Database? What architecture? More specifically...how would you tie in the long/lat distance formula into the "SELECT" statement?)

+3  A: 
  1. Create a MyISAM table with a column of datatype Point

  2. Create a SPATIAL index on this column

  3. Convert the GPS coords into UTM (grid) coords and store them in your table

  4. Issue this query:

    SELECT  user_id, GLength(LineString(user_point, @mypoint))
    FROM    users
    WHERE   MBRWithin(user_point, LineString(Point(X(@mypoint) - 20, Y(@mypoint - 20)), Point(X(@mypoint) + 20, Y(@mypoint + 20))
            AND GLength(LineString(user_point, @mypoint)) <= 20
    

Note that this query will most probably be run on very volatile data and you will need to do the additional checks on time.

Since MySQL cannot combine SPATIAL indexes, it will be better to use some kind of surface tiling technology:

  1. Split the Earth surface into a number of tiles, say, 1 x 1 " (it's about 30 meters of the meridian and 30 * COS(lon) of the parallel.

  2. Store the data in the CHAR(14) column: 7 digits of the lat + 7 digits on the lon (14 digits at all). Disable key compression on this column.

  3. Create a composite index on (time, tile)

  4. On the client, calculate all possible tiles your mates may be in. For 20 meters distance, this will be at most 9 tiles, unless you are deep at North or South. However, you may change the tiling algorithm to handle these cases.

  5. Issue this query:

    SELECT  *
    FROM    (
            SELECT  tile1
            UNION ALL
            SELECT  tile2
            UNION ALL
            …
            ) tiles
    JOIN    users u
    ON      u.tile  = tiles.tile
            AND u.time >= NOW() 
            AND GLength(LineString(user_point, @mypoint)) <= 20
    

, where tile1 etc are precalculated tiles.

SQL Server implements this algorithm for its spatial indexes (rather than R-Tree that MySQL uses).

Quassnoi
Wow, this is great
TIMEX
Quassnoi...did you write that formula yourself? Or is it pretty standard? (the select where... formula)
TIMEX
@alex: I wrote the formula myself using the standard `MySQL` geomery functions: http://dev.mysql.com/doc/refman/5.0/en/geometry-property-functions.html Note that these functions require a Euclidean plane, hence the requirement to store in `UTM`.
Quassnoi
A: 

Chapter 11 "Database Design Know it All" has some thoughts on how to design such a database.

Anonym
Thanks, I'll check that out if I ever get a user/pass :)
TIMEX
+2  A: 

Well, the naive approach would be to do an O(n) pass over all points, get their distance from the current point, and find the top 20. This is perfectly Ok for small datasets (say <= 500 points), but on larger sets it's going to be quite slow. In SQL, this would be along the lines of:

SELECT point_id, DIST_FORMULA(x, y) as distance
FROM   points
WHERE  distance < 20

To address the inefficiency of the above method, you would have to use some sort of preprocessing step, most likely space partitioning. That can often dramatically improve performance in nearest neighbour type of searches like this. However, in your case, if all the points are updated every 10 seconds, you would have to do an Ω(n) pass to update the position of each point in the space partitioning tree. If you have more than a few queries between each update, it will be useful, otherwise it'll simply be an overhead.

Max Shawabkeh