views:

255

answers:

1

I have a text column varchar(4000) with text:

'aaabbaaacbaaaccc'

and I need to remove all duplicated chars - so only one from sequence left:

'abacbac'

It should not be a function, Procedure or CLR - Regex solution. Only true SQL select.

Currently I think about using recursive WITH clause with replace 'aa'->'a', 'bb'->'b', 'cc'->'c'.

So recursion should cycle until all duplicated sequences of that chars would be replaced.

Do you have another solution, perhaps more performant one?

PS: I searched through this site about different replace examples - they didn't suit to this case.

+3  A: 

Assuming a table definition of

CREATE TABLE myTable(rowID INT IDENTITY(1,1), dupedchars NVARCHAR(4000)) 

and data..

 INSERT INTO myTable
      SELECT 'aaabbaaacbaaaccc'
       UNION
      SELECT 'abcdeeeeeffgghhaaabbbjdduuueueu999whwhwwwwwww'

this query meets your criteria

    WITH Numbers(n)
      AS
       (   SELECT 1 AS n
          UNION ALL
             SELECT (n + 1) AS n
              FROM Numbers
             WHERE n < 4000
       )
  SELECT rowid,
       (   SELECT CASE 
           WHEN SUBSTRING(dupedchars,n2.n,1) = SUBSTRING(dupedchars+' ',n2.n+1,1) THEN '' 
           ELSE SUBSTRING(dupedchars,n2.n,1) 
            END AS [text()]
           FROM myTable t2,numbers n2
          WHERE n2.n <= LEN(dupedchars)
            AND t.rowid = t2.rowid
            FOR XML path('')
       ) AS deduped
    FROM myTable  t
  OPTION(MAXRECURSION 4000)

Output

rowid   deduped
   1    abacbac
   2    abcdefghabjdueueu9whwhw
CResults
CResults: it's Fantastic! )) I thought almost about the same. But different approach. Yours one is more universal!Thanks! And what about Performance issues for table with 100 000+ rows ??? Am I right, that it is THE ONLY one option doing this via Native SQL?
zmische
For that many rows you're looking at an execution time of around 10 seconds. The alternatives (which I was looking at originally) would be to have a physical table alternative to Numbers with an index. You *may* get some improvement from that but the slow part of the query is the de-duping - any string manipulation of this type will have a speed overhead.
CResults
Note the 10 seconds is based on string lengths similar to above. As suggested the time is involved in de-duping. Set all your fields to 4000 characters and you're looking at around 1000 results per minute. If you have duplicate values in your fields you will get an optimisation by only supplying the unique values to this query.
CResults