This is more of a challenge question than something I urgently need, so don't spend all day on it guys.
I built a dating site (long gone) back in 2000 or so, and one of the challenges was calculating the distance between users so we could present your "matches" within an X mile radius. To just state the problem, given the following database schema (roughly):
USER TABLE UserId UserName ZipCode
ZIPCODE TABLE ZipCode Latitude Longitude
With USER and ZIPCODE being joined on USER.ZipCode = ZIPCODE.ZipCode.
What approach would you take to answer the following question: What other users live in Zip Codes that are within X miles of a given user's Zip Code.
We used the 2000 census data, which has tables for zip codes and their approximate lattitude and longitude.
We also used the Haversine Formula to calculate distances between any two points on a sphere... pretty simple math really.
The question, at least for us, being the 19 year old college students we were, really became how to efficiently calculate and/store distances from all members to all other members. One approach (the one we used) would be to import all the data and calculate the distance FROM every zip code TO every other zip code. Then you'd store and index the results. Something like:
SELECT User.UserId
FROM ZipCode AS MyZipCode
INNER JOIN ZipDistance ON MyZipCode.ZipCode = ZipDistance.MyZipCode
INNER JOIN ZipCode AS TheirZipCode ON ZipDistance.OtherZipCode = TheirZipCode.ZipCode
INNER JOIN User AS User ON TheirZipCode.ZipCode = User.ZipCode
WHERE ( MyZipCode.ZipCode = 75044 )
AND ( ZipDistance.Distance < 50 )
The problem, of course, is that the ZipDistance table is going to have a LOT of rows in it. It isn't completely unworkable, but it is really big. Also it requires complete pre-work on the whole data set, which is also not unmanageable, but not necessarily desireable.
Anyway, I was wondering what approach some of you gurus might take on something like this. Also, I think this is a common issue programmers have to tackle from time to time, especially if you consider problems that are just algorithmically similar. I'm interested in a thorough solution that includes at least HINTS on all the pieces to do this really quickly end efficiently. Thanks!