views:

89

answers:

1

I am in the process of extracting location entities ('Madison Square Garden', 'San Diego Zoo', etc.) from a large table of non-uniform location. I'm trying to avoid multiple entities in my new table. I don't have reliable street addresses, but am trying to extract addresses within a local range.

I am doing this by counting and matching across similar names within a geo range of 30km.

I am going through a geocoded cities table, and for each city creating a lat/long range and then extracting the location names within that range.

Using Madison Square Gardens as an example. The venue would be overlapped multiple times from different cities in the cities table. For instance, Manhattan, Newark, etc. would all likely overlap the NYC range and result in multiple entries for some properties in NYC.

I can't just set the range to a smaller area as I would be missing locations, or worse, ending up with more multiple entries as I don't often have good street/city/geocoded locations (though a 30km area seems to work pretty well).

I was hoping that geohashing would enable me to create a bounding box where I would be able to find the overlap, and avoid multiple entries, but that doesn't seem to be the case based on what I've found.

Ideally, I believe I am looking for a hash or some other value I can put into a cell, and store a unique location name/hash so mysql doesn't end up with multiple entries for a single location.

I don't have proper street address for each entry (though I do have some), so getting street level accuracy and not using a range doesn't really seem to be an option. I also don't have consistent city names, so I end up with NYC, New York, Manhattan, etc. etc. But my cities table is better, including New York, but not NYC.

_--------------EDITED-------------------- After taking another look at Geohash.org, I think I am on the right track, just missing a bit of granularity. If I take the first two characters of the hash, that is consistent within a range that is just too large. For instance, everything from victoria, bc to portland, or is all 'C2' for the first two characters. However, going to a 3 character solution, NYC starts with 'dr5' and so does Newark (which is good), but Hackensack is 'dr7'. Worse, Vanouver, bc. is 'C2b', but the suburb of Richmond bc is 'C28'.

A: 

I've built a similar service that takes a lat/lon coordinate and returns the nearest city. I used a database from http://www.globixdata.com/ which contained the city names and their lat/lon coordinates as well as other data that may be relevant such as approximate population. You may not get 100% accurate results if your address or point is on the border between cities.

I paid for the database but you may be able to find something similar for free online with a bit of sleuthing.

John Kramlich