views:

625

answers:

4

I have an Address column in a table that I need to split into multiple columns in a view in SQL Server 2005. I need to split the column on the line feed character, chr(10), and there could be from 1 to 4 lines (0 to 3 line feeds) in the column. Below are a couple of examples of what I need to do. What is the simplest way to make this happen?

Examples:

Address                 Address1      Address2       Address3            Address4
------------        =   -----------   -----------    -----------------   ---------
My Company              My Company     123 Main St.  Somewhere,NY 12345  
123 Main St.         
Somewhere,NY 12345

Address                 Address1       Address2      Address3      Address4
------------        =   ------------   ----------    -----------   ---------
123 Main St.            123 Main St.
A: 

Parsing text in SQL is not fun. If I had to do something like this, I would export the column to a csv text file and parse it in scripting language such as Perl/PHP/Python. That way I can take advantage of built in string functions and regular expression of scripting language.

Yada
+3  A: 

this will split the address by using the parsename function and combining that with COALESCE to grab the correct info in the correct column

if you have more than 4 lines this method will NOT work

edit: added the code to reverse the order

    create table #test (address varchar(1000))

    --test data
    insert #test values('My Company
    123 Main St.         
    Somewhere,NY 12345')

    insert #test values('My Company2
    666 Main St.  
    Bla Bla       
    Somewhere,NY 12345')

    insert #test values('My Company2')

    --split happens here
                            select
replace(parsename(address,ParseLen +1),'^','') as Address1,
replace(parsename(address,ParseLen ),'^','') as Address2,
replace(parsename(address,ParseLen -1),'^','') as Address3,
replace(parsename(address,ParseLen -2),'^','') as Address4
from(
select case  ascii(right(address,1)) when 10 then
replace(replace(left(address,(len(address)-1)),'.','^'),char(10),'.')  
else 
replace(replace(address,'.','^'),char(10),'.') end as address,
case  ascii(right(address,1)) when 10 then
len(replace(replace(address,'.','^'),char(10),'.')) -
len(replace(replace(address,'.','^'),char(10),'')) -1
else
len(replace(replace(address,'.','^'),char(10),'.')) -
len(replace(replace(address,'.','^'),char(10),'')) end as ParseLen
 from #test) x
SQLMenace
This does a good job of parsing the pieces apart, but the parsename function fills its array in reverse order. So if you have something like 123.456.789 it returns 1=789, 2=456 and 3=123. And if you have 123.456 it returns 1=456 and 2=123. In both these scenarious, I need 1=123, 2=456 and in the first example 3=789.Not sure if that is clear. I feel like I should be able to do this using your coalesce method and going in reverse order or something, but I can't seem to get it right.
Jamie
added the code to reverse the order
SQLMenace
Ok, it is almost there. The only problem I am seeing now is that is returns NULL for all four fields if there is a line feed at the end of the source field. In other words there is a blank last line. Is there a way we could clean up any line feeds and/or spaces at the end that might be throwing it off? Thanks for all your help SQLMenace!
Jamie
Jamie, did you try the solution I posted? It should just treat anything after a 4th line feed as more data on the 4th line (however this indicates yet another data integrity problem in your solution).
Aaron Bertrand
Aaron, I did try your solution and the output is exactly what I need. The only problem is speed. I didn't mention run time in my original post, but it needs to execute as quickly as possible. The solution above executes twice as fast as the one you posted, but your output seems to be spot on. Do you see any way to optimize your solution any further? Thanks so much for your help.
Jamie
Updated to account for ending line feed
SQLMenace
Yes, of course I see a way to optimize the solution: fix the problem! But it seems that it is out of the question, so the rest of this mess is what you get. I don't find any of these solutions very intuitive or maintainable, and I can't imagine that there is no way to fix this, even if it is creating a dummy table that just holds the different address parts, and which you update whenever the main table is updated (hopefully by restricting write access via procs, otherwise with a trigger). If you keep putting garbage into that table, you'll keep needing to deal with the garbage coming out.
Aaron Bertrand
+1  A: 

This is awfully nasty... I strongly recommend that if you want to treat each address line separately, that you store it correctly in the first place. Instead of continuing to do what you're doing, add the additional columns, fix the existing data once (instead of "fixing" it every time you run a query), and then adjust the stored procedure that does the insert / update so that it knows to use the other columns.

DECLARE @Address TABLE(id INT IDENTITY(1,1), ad VARCHAR(MAX));

INSERT @Address(ad) SELECT 'line 1
line 2
line 3
line 4'
UNION ALL SELECT 'row 1
row 2
row 3'
UNION ALL SELECT 'address 1
address 2'
UNION ALL SELECT 'only 1 entry here'
UNION ALL SELECT 'let us try 5 lines
line 2
line 3
line 4 
line 5';

SELECT
    id,
    Line1 = REPLACE(REPLACE(COALESCE(Line1, ''), CHAR(10), ''), CHAR(13), ''),
    Line2 = REPLACE(REPLACE(COALESCE(Line2, ''), CHAR(10), ''), CHAR(13), ''),
    Line3 = REPLACE(REPLACE(COALESCE(SUBSTRING(Rest, 1, COALESCE(NULLIF(CHARINDEX(CHAR(10), Rest), 0), LEN(Rest))), ''), CHAR(10), ''), CHAR(13), ''),
    Line4 = REPLACE(REPLACE(COALESCE(SUBSTRING(Rest, NULLIF(CHARINDEX(CHAR(10), Rest) + 1, 1), LEN(Rest)), ''), CHAR(10), ''), CHAR(13), '')
FROM

(
    SELECT 
        id,
        ad,
        Line1,
        Line2 = SUBSTRING(Rest, 1, COALESCE(NULLIF(CHARINDEX(CHAR(10), Rest), 0), LEN(Rest))),
        Rest = SUBSTRING(Rest, NULLIF(CHARINDEX(CHAR(10), Rest) + 1, 1), LEN(Rest))
    FROM
    (
        SELECT
            id,
            ad,
            Line1 = SUBSTRING(ad, 1, COALESCE(NULLIF(CHARINDEX(CHAR(10), ad), 0), LEN(ad))),
            Rest = SUBSTRING(ad, NULLIF(CHARINDEX(CHAR(10), ad) + 1, 1), LEN(ad))
        FROM
            @address
    ) AS x
) AS y
ORDER BY id;

Denis' PARSENAME() trick is much tidier of course, but you have to be extremely careful about using a replacement character that is truly impossible to appear in the data naturally. The carat (^) is probably a good bet, but like I said, you need to be careful.

There are also software packages out there that are really good at scrubbing address and other demographic data. But cleaning up the data entry is the most important thing here that I'll continue to stress... if each address line needs to be treated separately, then store them that way.

Aaron Bertrand
I absolutely agree 100%, but in this particular instance, I have no control over the structure of the data. It is frustrating, but it is what it is.
Jamie
Well, as long as the users are willing to wait for the view to perform this splitting every single time you run a query... then I guess you're right, it is what it is (crappy design).
Aaron Bertrand
A: 

If this is a one-time thing, you could try exporting it to CSV and using http://cleanupdata.com/.

technophile
Not a one-time thing, but thanks for the http://cleanupdata.com tip. That may prove useful at some point.
Jamie