views:

723

answers:

6

I figure this problem is easier than just a regular spell checker since the list of U.S cities is small compared to all known English words.

Anyhow, here's the problem: I have text files with full of city names; some of which are spelled correctly and some which aren't.

What kind of algorithm can I use to correct all the misspellings of city names?

+7  A: 

Do you actually need to correct the misspellings or just flag them as with a normal spell checker? If the latter, you just need to obtain a list of correct spellings and make sure each name is the same as one in your list.

If you want to actually correct them, you probably want to use the concept of edit distance to compare the similarity of misspelled strings to those in your reference list. Then you can replace the misspelled word with the closest match. You may also want to handle the possibility that the intended city is not in your list.

The Levenshtein distance Wikipedia article is another good resource.

108
I need to correct them.
Esteban Araya
Yes, I think the edit distance idea is the right approach for this. That's the one I headed down on when I started this anyhow.
Esteban Araya
A: 

If the same city name occurs more than once in the file you can use the number of occurrence of each city name and flag the one that appears only once.

Julien Grenier
It is possible that a city appears only once and correctly spelled.
Esteban Araya
... or is misspelled, the same way, more than once.
Brad Gilbert
Esteban and Brad : Of course, you are right but if the only things you've got to validate yourself is the file you have to rely on occurence to calculate the statistic
Julien Grenier
+2  A: 

First load the correct city names into an array, then loop through the city names in your file. Check if the current city name is spelled right by seeing if it's in the array of correct names. If it isn't in the array, try comparing the Soundex or Metaphone value of the misspelled word with the words in the array of correct names to find the correct way of spelling it.

yjerem
A: 

There are lists on the web of commonly misspelled city names (like Pittsburg**h**). Other than that I am with Jeremy. You just gotta find the city names dataset, you might want to try the USGS. Zillow has neighbourhood data you might be able to use that.

vfilby
+2  A: 

The trick is knowing which city the name actually refers to and how that city name is correctly spelt. It's not the same as just checking english words.

What's the real task you're trying to solve? Are you processing address lists? You shouldn't be writing your own tools for that: there is an entire industry devoted to this deceptively simple task. :)

I have to do this for the subscription lists for The Perl Review. I've become quite familiar with the webservices for the various post offices throughout the world. You can often go to a postal service website to get a canonical form of an address. There are geocoding tools that can get you the same data.

brian d foy
You're right, it's deceptively tricky. Now that I've been playing around with it, I've also noticed that people sometimes abbreviate city names. Writing this is a great excercise in DP; I'm certain one can achiever fairly decent results w/o too much effort.
Esteban Araya
A: 

I've done this. The edit distance approach is what I did and it works pretty well, but is too slow to do in real time.

One challenge you'll face is that there are a number of cities that are 1 edit distance away from other city names. You didn't say where the names in the text file came from and that makes a big difference. When In my case it was random people that were typing in the city names for a search and they would occasionally misspell the city they intended, but their misspelling was a real city name. In this case you have to make some guesses about the users intent and one easy way to do this is to consider the state if provided.

jshen