tags:

views:

80

answers:

3

What kind of work has been done to determine whether a specific string pertains to a geographical location? For example:

'troy, ny'
'austin, texas'
'hotels in las vegas, nv'

I guess what I'm sort of expecting is a statistical approach that gives a degree of confidence that the first two are locations. The last one would probably require a heuristic which grabs "%s, %s" and then uses the same technique. I'm specifically looking for approaches that don't rely too heavily on the proposition 'in', seeing as it's not an entirely unambiguous or consistently available indicator of location.

Can anyone point me to approaches, papers, or existing utilities? Thanks!

+1  A: 

A link to help: geonames.org search:

returns the names found for the searchterm as xml or json document

example: http://ws.geonames.org/search?q=troy,%20ny&maxRows=10

Pierre
+2  A: 

The problem you describe is often called geographic query parsing or more generally geographic information retrieval.

There was a recent task on doing this at CLEF 2007 (http://www.uni-hildesheim.de/geoclef/2007/Query-Parsing.htm). The winning team used a rule based grammar, which is similar to what you probably don't want. Another paper at www2009 talks about GeoParser: http://www2009.eprints.org/239/.

There are also some papers on Geographic Information Retrieval at CIKM 2007: http://www.geo.unizh.ch/~rsp/gir07/accepted.html

I don't know of any open source software that does this, but it may be bundled into a search engine like Lemur.

ealdent
+1  A: 

There is a very interesting approach taken by Everyblock.com that is focused on how locations are expressed in English -- they basically use some sophisticated and extensive regular expressions that are now open source. Their application is designed to scan through news articles, reviews, and various public data feeds and relate them to specific locations, and it works well. Expressions like "A fire in the building on the North-East corner of 20th and Valencia St. in San Francisco" are very accurately geocoded. You can study the source here. The particular part you probably want is epub/epub/geocoder/base.py and everything around it, for example starting with the SmartGeocoder class and working backwards.

bvmou