tags:

views:

143

answers:

3

I have a csv file with addresses that include ',', empty space, numbers. I have two other files with city and zip codes. I want to read the address file and extract the city name and zip codes. As you will see below the addresses have no easy pattern and is randomly sprinkled with ',' spaces etc.Here is an example.

Address file
123 Riverside Drive Riverside CA 12034
Santa clara CA 93453
231 Monroe drive, OR
43 Mystic cove, O'Fallon 63045

City file
riverside
O'fallon
santa clara
Morgantown

Zip code file
02343
23454
12034
93453

Expected output file (corresponding to the input address file) in two columns

City Zipcode
Riverside 12034
Santa clara 93453
Missing Missing
O'Fallon Missing

Note that the matching shouldn't be case sensitive. I am more familiar with R but will be happy with any help in python too.
Thank you in advance.

+1  A: 

You might consider using something like this, though be careful -- there are plenty of things that can go wrong. For example,

'12034 Riverside Road' would be misinterpreted to be Riverside city, with zipcode 12034.

One way to avoid such a mistake would be to enumerate all the forms an address can take, and then use pyparsing or regex to try to match those forms.

Another problem with the code below is that it forms two possibly gigantic regexps (if the zipcode and/or city files are very large). I'm not sure how the code would perform under such a condition. We can think about how to rework the code if this proves to be a problem.

import re
import itertools as it

with open('zipcode','r') as z:
    zipcode_pat=re.compile('({0})'.format('|'.join(line.rstrip() for line in z)),
                        re.IGNORECASE)
with open('city','r') as c:
    city_pat=re.compile('({0})'.format('|'.join(line.rstrip() for line in c)),
                      re.IGNORECASE)

def gitone(seq):
    for match in seq:
        if match:
            yield match.group(1)
        else:
            yield 'Missing'

with open('address','r') as f:
    f1,f2=it.tee(f,2)
    zipcodes=gitone(zipcode_pat.search(line) for line in f1)
    cities=gitone(city_pat.search(line) for line in f2)
    for city,zipcode in it.izip(cities,zipcodes):
        print('{c} {z}'.format(c=city,z=zipcode))

# Riverside 12034
# Santa clara 93453
# Missing Missing
# O'Fallon Missing
unutbu
+2  A: 

Use Google Refine. It's an open source project that is perfect for data cleansing and export into a format that actually makes sense. Everything is GUI based and it stores a complete history. You wont need to mess around with tedious regular expressions.

Tim McNamara
A: 

I'm sure a gsubfn wizard could do this better but here are some initial steps:

Addresses <- "123 Riverside Drive Riverside CA 12034
 Santa clara CA 93453
 231 Monroe drive, OR
 43 Mystic cove, O'Fallon 63045"
cities <- tolower(c("riverside", "O'fallon", "santa clara", "Morgantown"))
addrs <- readLines(textConnection(Addresses) )

To get the lines which have ST nnnnn, which appears to be the rule you want, try:

   gsub(".*[A-Z]{2},? (\\d{5})", "\\1", addr.df)

[1] "12034" "93453"
[3] "231 Monroe drive, OR" "43 Mystic cove, O'Fallon 63045"

And mark those lines with nchar(lines) ==5 .

gsub(".*[A-Z]{2},? (\\d{5})", "\\1", addrs)[
+ grep("^\\d{5}$", gsub(".*[A-Z]{2},? (\\d{5})", "\\1", addr.df) )]

[1] "12034" "93453"

To get the line indices with valid city names:

unlist( sapply(cities, function(xpatt) grep(xpatt, tolower(addrs)) ) )

riverside o'fallon santa clara

      1           4           2 
DWin