views:

54

answers:

2

Background

To replace invalid zip codes.

Sample Data

Consider the following data set:

 Typo  | City       | ST | Zip5
-------+------------+----+------
 33967 | Fort Myers | FL | 33902
 33967 | Fort Myers | FL | 33965
 33967 | Fort Myers | FL | 33911
 33967 | Fort Myers | FL | 33901
 33967 | Fort Myers | FL | 33907
 33967 | Fort Myers | FL | 33994
 34115 |Marco Island| FL | 34145
 34115 |Marco Island| FL | 34146
 86405 |  Kingman   | FL | 86404
 86405 |  Kingman   | FL | 86406

33967 closely matches 33965, although 33907 could also be correct. (In this case, 33967 is a valid zip code, but not in our zip code database.)

34115 closely matches is 34145 (off by one digit, with a difference of 3 for that digit).

86405 closely matches both.

Sometimes digits are simply reversed (e.g,. 89 instead of 98).

Question

How would you write a SQL statement that finds the "minimum distance" between multiple numbers that have the same number of digits, returning at most one result no matter what?

Ideas

  • Subtract the digits.
  • Use LIMIT 1.

Conditions

PostgreSQL 8.3

+4  A: 

This sounds like a case for Levenshtein distance.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

It looks like PostgreSQL has it built-in:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

RedFilter
@Dave, just found that myself, thnx.
RedFilter
This answers the OP's question (so +1 there); however, the example the OP gave for wanting this solution gives me pause. I really think that the OP would be better served by not relying on string analysis to try and figure out what the zip code really should have been. Instead, it seems a better solution would be to tie in to the USPS/MapQuest/Google Maps and validate the entire address while replacing incorrect information.
Chris Lively
For example, consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one? The only way to get that is to have the full address validated.
Chris Lively
@RedFilter: This helps, but the Levenshtein distance is 1 for 33967 against both "33907" and "33965", since they both differ by 1 character.
Dave Jarvis
@Chris: Google has a no-automated queries clause in their TOS.
Dave Jarvis
@Dave: If you want to give preference to numbers closer together, order by Levenshtein distance and then `Zip5`, and select top 1.
RedFilter
@Dave, sorry, order by order by Levenshtein distance and then `ABS(Typo - Zip5)`
RedFilter
+2  A: 

Redfilter answered the question that was asked, but I just wanted to clarify that the requested solution will not resolve what appears to be the real problem.

The real problem here seems to be that you have a database which was hand keyed and some numbers were transcribed giving garbage data.

The ONLY way to solve this problem is to validate the full address against a database like the USPS, MapQuest, or another provider. I know the first two have API's available for doing this.

The example I gave in a comment above was to consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one?

Another equal problem is what if the entered zip code was 75083 for Richardson. Which is a valid zipcode for that city; however, what if the address resided in 75082?

The only way to get that is to have the full address validated.

Chris Lively
@Chris: You are correct. RedFilter answered my question as asked, but the question was incorrect. Your answer is the one that I really wanted to know. Thank you.
Dave Jarvis