views:

73

answers:

0

I've got a rather large database of location addresses (500k+) from around the world. Though lots of the address are duplicates or near duplicates. Whenever a new address is entered, I check to see if it is in the database already, and if so, i take the already existing lat/long and apply it to the new entry. The reason I don't link to a separate table is because the addresses are not used as a group to search on, and their are often enough differences in the address that i want to keep them distinct. If I have a complete match on the address, I apply that lat/long. If not, I go to city level and apply that, if I can't get a match there, I have a separate process to run.

Now that you have the extensive background, the problem. Occasionally I end up with a lat/long that is far outside of the normal acceptable range of error. However, strangely, it is normally just one or two of these lat/longs that fall outside the range, while the rest of the data exists in the database with the correct city name.

How would you recommend cleaning up the data. I've got the geonames database, so theoretically i have the correct data. What i'm struggling with is what is the routine you would run to get this done.

If someone could point me in the direction of some (low level) data scrubbing direction, that would be great.