tags:

views:

77

answers:

2

Following on from a previous question, I'm trying to clean up some data where IDs are stored as a comma-separated list of values. I need to have these broken out into separate rows. I have what a query that works, but is rather slow. Do you have any ideas that would faster than what I'm doing?

SET NOCOUNT OFF
DECLARE @Conversion TABLE
(
    ID bigint
    , LogSearch_ID int
    , LogSearchDimension_ID int
    , SearchValue varchar(MAX)
)
DECLARE @RowsUpdated int, @MaxRows int, @NumUpdates int;
SET @MaxRows = 500;
SET @NumUpdates = 0;
SET @RowsUpdated = 1;
WHILE @RowsUpdated > 0 AND @NumUpdates < @MaxRows
BEGIN
    INSERT INTO @Conversion (ID, LogSearch_ID, LogSearchDimension_ID, SearchValue )
    SELECT TOP 1
     ID, LogSearch_ID, LogSearchDimension_ID, SearchValue
     FROM LogSearchesDimensions (NOLOCK)
     WHERE LogSearchDimension_ID = 5 AND SearchValue LIKE '%,%';

    INSERT INTO LogSearchesDimensions (LogSearch_ID, LogSearchDimension_ID, SearchValue)
    SELECT 
     LogSearch_ID
     , LogSearchDimension_ID 
     , s
    FROM 
     @Conversion
    -- The split function returns a table value with each item as a row in column 's'
    dbo.Split((SELECT SearchValue FROM @Conversion), 0, 0);

    SET @RowsUpdated = @@rowcount;
    SET @NumUpdates = @NumUpdates + 1;
    DELETE FROM LogSearchesDimensions WHERE ID = (SELECT ID FROM @Conversion)
    DELETE FROM @Conversion;

END

The split function looks like this (I didn't write it myself):

CREATE FUNCTION SPLIT
(
  @s nvarchar(max),
  @trimPieces bit,
  @returnEmptyStrings bit
)
returns @t table (val nvarchar(max))
as
begin

declare @i int, @j int
select @i = 0, @j = (len(@s) - len(replace(@s,',','')))

;with cte
as
(
  select
    i = @i + 1,
    s = @s,
    n = substring(@s, 0, charindex(',', @s)),
    m = substring(@s, charindex(',', @s)+1, len(@s) - charindex(',', @s))

  union all

  select
    i = cte.i + 1,
    s = cte.m,
    n = substring(cte.m, 0, charindex(',', cte.m)),
    m = substring(
      cte.m,
      charindex(',', cte.m) + 1,
      len(cte.m)-charindex(',', cte.m)
    )
  from cte
  where i <= @j
)
insert into @t (val)
select pieces
from
(
  select
  case
    when @trimPieces = 1
    then ltrim(rtrim(case when i <= @j then n else m end))
    else case when i <= @j then n else m end
  end as pieces
  from cte
) t
where
  (@returnEmptyStrings = 0 and len(pieces) > 0)
  or (@returnEmptyStrings = 1)
option (maxrecursion 0)

return

end

GO

So what the query is doing is grabbing a single row that has a comma separate value in it, breaking it out into multiple rows, insert it back into the dimensions table, and then deleting the original row. It's taking forever to go through and run the updates. Do you have any suggestions for improvement?


Here's the final solution I settled on. Not terribly fast, but stable and faster than doing all of the looping to split strings.

SET NOCOUNT ON
DECLARE @RowsUpdated int, @MaxRows int, @NumUpdates int, @SQL varchar(max);
SET @MaxRows = 100;
SET @NumUpdates = 0;
SET @RowsUpdated = 1;
WHILE @RowsUpdated > 0 AND @NumUpdates < @MaxRows
BEGIN
    BEGIN TRANSACTION
     SET @SQL = (
     SELECT TOP 1
      'INSERT INTO LogSearchesDimensions (SearchValue, LogSearch_ID, LogSearchDimension_ID) SELECT ' 
      + REPLACE(SearchValue, ',', ', ' + Cast(LogSearch_ID AS varchar) + ', ' + CAST(LogSearchDimension_ID AS varchar) + ' UNION ALL SELECT ') 
      + ', ' + Cast(LogSearch_ID AS varchar) + ', ' + CAST(LogSearchDimension_ID AS varchar) + ';'
      + 'DELETE FROM LogSearchesDimensions WHERE ID = ' + CAST(ID AS varchar) + ';' AS SQL
      FROM LogSearchesDimensions (NOLOCK)
      WHERE LogSearchDimension_ID = 5 AND SearchValue LIKE '%,%'
     )
     SET @RowsUpdated = @@rowcount;
     IF @RowsUpdated = 0
      BREAK

     SET @NumUpdates = @NumUpdates + 1;

    COMMIT
END
A: 

Doing the SPLITing in SQL will be slow. Have you considered exporting the data to a flat file and using an SSIS package to re-import?

Dave Swersky
I haven't, strictly because I don't have any experience with SSIS :/
Daniel Short
A: 

Instead of a split inside your cursor through the table, try something like this:

DECLARE @sql varchar(MAX);
SELECT @sql = 'insert into mytable(id, otherfield1, otherfield2) select '
  + REPLACE(@idfield, ',', ', ' + @otherfield1 + ', ' + @otherfield2 union all select ')
EXEC(@SQL);

Then, after the cursor finishes working through rows that have comma-separated values, a simple delete statement.

This assumes otherfield and otherfield2 are numeric, otherwise you'll need to do some escaping in that dynamic SQL.

richardtallent
As "dirty" as this makes me feel, it does run quicker. I've implemented this strategy and it's handling about 1 record per second. Still a lot slower than I'd like, but it's better than the 1 record per 3 or 4 seconds I was getting prior. Thanks.
Daniel Short
If you want to feel slightly less dirty, you could do something similar by letting `sp_xml_preparedocument` parse the CSV into an XML document variable and then operate on that, or maybe create a CLR function to do the work, but I don't think either would improve performance.
richardtallent