views:

1027

answers:

6

I have a very high-traffic table with a char(50) field which takes part in several indexes. This char(50) field allows NULLS, and in that case a NULL value is considered to be the same as a non-NULL, zero-length string for my purposes.

I also disregard leading & trailing whitespace and, while I scrub the data before I insert it, it may also be inserted by means beyond my control.

I have a sproc that is used to copy data from one table to the main table, and it needs to be high-performance. I need to delete duplicate records before inserting the new data and I am using the method discussed in this thread to perform the deletes.

My delete statement looks like this (simplified):

delete masterTable
from masterTable t
    join incomingDataTable inc on
    (
     LTRIM(RTRIM(COALESCE(inc.TextField,''))) = 
             LTRIM(RTRIM(COALESCE(t.TextField,'')))
    )
where LTRIM(RTRIM(COALESCE(t.TextField,''))) <> ''

I have read that constructs like LTRIM(RTRIM(...)) are bad. Can my delete statement be improved, and if so, how?

EDIT: Just to clarify, TextField does take part in indexes on both tables. EDIT2: TextField is defined as char(50) in both tables. It is not of type TEXT.

+2  A: 

I think in SQL Server you can match a padded string with a non-padded string, saving yourself the effort on the LTRIM/RTRIM, however I'm not 100% on that...

However, tidying data is all part of ETL and needs to be done before your data gets to where it is going. You may find on large datasets that it is quicker to create a temporary copy of the data, reprocess it, index it, then do the required matching.

ck
+4  A: 

It is bad because your JOIN will have to scan the whole index, this is because your condition is not sargable

Are you also sure is is a TEXT datatype? Last I checked you could not use LTRIM or RTRIM against a Text datatype column?

in response to the char against varchar comment, run this

declare @v varchar(50),@v2 char(50)
select @v ='a',@v2 = 'a'

select datalength(@v),datalength(@v2)
SQLMenace
Any recommendations on how to improve it?
John Dibling
Regarding your question, it is specifically char(50). Sorry I was imprecise, and I will edit my OP.
John Dibling
if you have leading spaces then you probably can't improve it. You should have a constraint on the table that does not allow leading spaces that way you don't have to worry about this later
SQLMenace
you should have used varchar(50) not char(50) unless you always have 50 chars but since you are doing RTRIM I doubt it
SQLMenace
Why varchar vs char? Does it take less space, or some other reason?
John Dibling
I would say clean the data first, add the constraint and then you can do a regular join
SQLMenace
run the additional code I posted and you will see that char uses all 50 bytes
SQLMenace
+2  A: 

I believe that SQLMenace is correct.

How about adding an INSERT/UPDATE trigger to the table to guarantee that there is no whitespace on that column?

IF the column is VARCHAR, SQL Server will automatically ignore ending whitespace. Leading whitespace still counts though.

Actually, wouldn't SQL Server automatically pad both columns to CHAR(50) before doing the JOIN? (Implict conversions.)

beach
+6  A: 

You need to:

  1. Create a computed column on masterTable using expression LTRIM(RTRIM(COALESCE(TextField,'')))
  2. Build an index on this column and
  3. Use this column in a join.

The way your table is designed now it's quite impossible to make this query index-friendly.

If you cannot change your table structure but can estimate the number of LEADING spaces, you may use an approach described here.

This solution, however, is far not as efficient as creating an index on a computed column.

Quassnoi
or clean the data and add a constraint which will disallow trailing and leading spaces
SQLMenace
Yes, though it may be a pain on an running database.
Quassnoi
+3  A: 

I would recommend changing that datatype to VARCHAR(50) - up to about 10 characters, the CHAR(x) might make sense since it's a tad faster and has less overhead - but with 50 characters, unless all the columns are fully used up to 50 chars, this is a major overhead, especially since this column is also used in indices.

Changing it to VARCHAR(50) would possibly quite significantly reduce the space needed for the table (depending on your amount of data and how much of the 50 chars are really used), and all the indices involved would also get a lot smaller - plus you wouldn't need this COALESCE, LTRIM, RTRIM stuff anymore :-)

Marc

marc_s
+2  A: 

If you have to trim the data everytime you use it, this should NOT be a char datatype but a varchar datatype. Anytime you have to use a function everytime you query a field, something is wrong with your database design.

Youmight find this discussion helpful: http://stackoverflow.com/questions/758699/is-the-char-datatype-in-sql-obsolete-when-do-you-use-it/760511#760511

HLGEM
You'll get no argument from me that there is something wrong with the db's design!
John Dibling