views:

103

answers:

4

I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other.

I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate.

However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc.

Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it?

I'm working with PHP/MySql, so I have the limitations inherent in that system.

A: 

You can use Google Map API (or any other mapping API) to normalize addresses as geo-location (lat/long).

Franci Penov
Wouldn't work because the majority of these geospatial api's fail to consider the apartment numbers (e.g., 12th floor 6th room).
code4life
Yes, such normalization wouldn't be 100% accurate and you would need to do additional checks as well.
Franci Penov
+1  A: 

I think your second idea is better than using Levenshtein distance. If you try to compare the addresses for similarity, then two different people who live nearby each other might accidentally "cheat" one another out of their prize. If I live at "S. 4th St." but my neighbor at "S. 5th St." already signed up, those two addresses might seem too similar by Lev distance.

You could reduce (but probably not eliminate) a lot of potential cheating by running addresses through a synonym normalizer. Before you check for equality, just convert

North -> N.
East -> E.
...
First -> 1st
Second -> 2nd
Third -> 3rd
...
Street -> St.
Avenue -> Ave.

The longer the list of synonyms you come up with, the better it will be at catching matches. It will be a little bit slower at processing, but addresses are tiny.

This is similar to converting strings to all lower (or upper) case before comparing them. (Which I also recommend, naturally.)

Bill the Lizard
Oh, finally I understand what you're saying! I haven't used levenshtein, so I wasn't familiar enough with it to forsee how that situation would arise :)
A: 

See these questions for related discussion.

  • Normalize your data first as much as possible:

    avenue -> ave road -> rd Rd. -> rd

    first -> 1 1st -> 1

You could look into SOUNDEX or something similar to catch cases where words sound same but have different spelling (e.g. Schmitt, Schmitd, Smith). SOUNDEX works on word level, so you'll need to split address to words first, and compare the SOUNDEX values.


You could also feed the addresses to some geo location service such as Google Maps, store resulting longitude and latitude to your database. When a new address is entered, you just get its longitude/latitude and compare against existing locations in your database. See this question for details.

Juha Syrjälä
A: 

CASS software does the standardization you want, and on much more than just the fragments you've been considering. See http://www.semaphorecorp.com for a cheap source. See http://semaphorecorp.com/mpdd/mpdd.html for a thorough discussion on detecting duplicates.

joe snyder