views:

372

answers:

3

I have a data set that I import into a SQL table every night. One field is 'Address_3' and contains the City, State, Zip and Country fields. However, this data isn't standardized. How can I best parse the data that is currently going into 1 field into individual fields. Here are some examples of the data I might receive:

'INDIANAPOLIS, IN 46268  US'
'INDIANAPOLIS, IN 46268-1234  US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'

Thanks in advance! David

+2  A: 

I've done something similar (not in T-SQL) and I find it works best to start at the end of the string and work backwards.

  • Grab the rightmost element up to the first space or comma.
  • Is it a known country code? It's a country
  • If not, is it all numeric (including a hyphen)? It's a zip code.
  • Else discard it

Grab the second rightmost element up to the next space or comma

  • Is it a two alpha-character field? It's the state

Grab everything else preceding the last comma and call it the city.

You'll need to make some adjustments based on what your input data looks like but the basic idea is to start from the right, grab the elements you can easily classify and call everything else the city.

You can implement something like this by using the REVERSE function to make searching easier (in which case you'll be parsing the string from left to right instead of right to left like I said above), the PATINDEX or CHARINDEX functions to find spaces and commas, and the SUBSTRING function to pull the address apart based on the positions found by PATINDEX and CHARINDEX. You could use the ASCII function to determine if a character is numeric or not.

You tagged your question with the SSIS tag as well - it might be easier to implement the parsing in some VB script in SSIS rather than try to do it with T-SQL.

TLiebe
A: 

In SSIS you can have 4 derived columns (city,state,zip,country).

substring(column,1,FINDSTRING(",",column,1)-1) --city substring(column,FINDSTRING(" ",column,1)+1,FINDSTRING("",column,2)-1) --state substring(column,FINDSTRING(" ",column,2)+1,FINDSTRING(" ",column,3)-1) -- zip

You can see the pattern above and continue accordingly. This might get a bit complicated. You can use a Script Component to better pull out the lines of text.

Josef Richberg
+1  A: 

By far the best way is to not reinvent the wheel and get an address parsing and standardization engine. Ideally, you would use a CASS certified engine which is what is approved by the Postal Service. However, there are free address parsers on the net these days and any of those would be more accurate and less frustrating than trying to parse the address yourself.

That said, I will say that address parsers and the Post Office work from bottom up (So, country, then zip code, then city, then state then address line 2 etc.).

Thomas