I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other.
I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate.
However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc.
Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it?
I'm working with PHP/MySql, so I have the limitations inherent in that system.