ansaurus

Question

Answer 1

+4 A:

This sounds like a case for Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

It looks like PostgreSQL has it built-in:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

RedFilter 2010-10-01 14:29:47

@Dave, just found that myself, thnx.

RedFilter 2010-10-01 14:36:32

This answers the OP's question (so +1 there); however, the example the OP gave for wanting this solution gives me pause. I really think that the OP would be better served by not relying on string analysis to try and figure out what the zip code really should have been. Instead, it seems a better solution would be to tie in to the USPS/MapQuest/Google Maps and validate the entire address while replacing incorrect information.

Chris Lively 2010-10-01 14:39:01

For example, consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one? The only way to get that is to have the full address validated.

Chris Lively 2010-10-01 14:41:51

@RedFilter: This helps, but the Levenshtein distance is 1 for 33967 against both "33907" and "33965", since they both differ by 1 character.

Dave Jarvis 2010-10-01 14:42:12

@Chris: Google has a no-automated queries clause in their TOS.

Dave Jarvis 2010-10-01 14:42:43

@Dave: If you want to give preference to numbers closer together, order by Levenshtein distance and then `Zip5`, and select top 1.

RedFilter 2010-10-01 15:40:36

@Dave, sorry, order by order by Levenshtein distance and then `ABS(Typo - Zip5)`

RedFilter 2010-10-01 15:58:57

Answer 2

+2 A:

Redfilter answered the question that was asked, but I just wanted to clarify that the requested solution will not resolve what appears to be the real problem.

The real problem here seems to be that you have a database which was hand keyed and some numbers were transcribed giving garbage data.

The ONLY way to solve this problem is to validate the full address against a database like the USPS, MapQuest, or another provider. I know the first two have API's available for doing this.

The example I gave in a comment above was to consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one?

Another equal problem is what if the entered zip code was 75083 for Richardson. Which is a valid zipcode for that city; however, what if the address resided in 75082?

The only way to get that is to have the full address validated.

Chris Lively 2010-10-01 14:44:56

@Chris: You are correct. RedFilter answered my question as asked, but the question was incorrect. Your answer is the one that I really wanted to know. Thank you.

Dave Jarvis 2010-10-01 14:51:57

ansaurus

tags:

views:

answers:

Match similar zip codes

Background

Sample Data

Question

Ideas

Conditions

related questions