tags:

views:

57

answers:

2

I've got a postgres 8.3 database of hotels, each with an associated longitude and latitude stored as a point, and a resort stored as a resort id. I'd like to find the central or average point of the resort.

I can do this using a simple query:

select 
avg(lat_long[0]) as latitude, 
avg(lat_long[1]) as longitude,
resort_id
from accomm 
group by resort_id

However, there is some bad data in the database, for example there might be an American hotel that is recorded in a European resort. Obviously doing a simple average this data will mean the results are inaccurate.

How can I calculate an interquartile mean, or similar method to filter out this bad data? I've currently got about 30,000 rows in my table.

+1  A: 

Are all of your hotels in the United States? It seems to me that it might be easier to create a bounding box and just disregard any lat/long combos that are outside of this range.

The biggest drawback of this is that it's not super precise. Basically you can exclude locations in Europe but something on the US/Canada border will probably not get excluded...

Abe Miessler
The hotels are all over the world. We've got something like 2,000 resorts, and each of them would need their own bounding boxes. We could extend our import procedure with a bounding box feature (i.e. when it sees a new hotel, check that it's coordinates are similar to existing hotels in the same resort). The biggest problems I can see with that is it doesn't help our existing data (although we can fix that), and there's a chicken and egg problem for new resorts - the first hotel added might be in the wrong continenent, and then we'd lose location data for all other hotels in that resort!
thelem
Do you have address information for each hotel? I had a similar issue once upon a time and was able to fix it by running a one time only script that passed each address to a mapping api (google, yahoo, microsoft, whatever) and then updated the lat longs if they were too far off. This would help your existing data. The bounding box idea doesn't seem realistic unless all your locations were in one country, so this might be a better way to go...
Abe Miessler
A: 

I would recommend splitting your map in grid of reasonable sized squares(like google do in google maps, you need zoom level around 7-9, I think) and calculate grid position for every point(you will get simple integer X and Y position). You can get number of points in every square and ignore squares with 1-2 points. Or you can use center of square with most points as test center and calculate distance to every point in group(this will be very fast in grid's XY coordinates) ignoring points which are not within reasonable distance from it.

Riz