views:

156

answers:

6

In my database, assume we have a table defined as follows:

CREATE TABLE [Chemical](
    [ChemicalId] int NOT NULL IDENTITY(1,1) PRIMARY KEY,
    [Name] nvarchar(max) NOT NULL,
    [Description] nvarchar(max) NULL
)

The value for Name can be very large, so we must use nvarchar(max). Unfortunately, we want to create an index on this column, but nvarchar(max) is not supported inside an index.

So we create the following computed column and associated index based upon it:

ALTER TABLE [Chemical]
ADD [Name_Indexable] AS LEFT([Name], 20)

CREATE INDEX [IX_Name] 
ON [Chemical]([Name_Indexable]) 
INCLUDE([Name])

The index will not be unique but we can enforce uniqueness via a trigger.

If we perform the following query, the execution plan results in a index scan, which is not what we want:

SELECT [ChemicalId], [Name], [Description] 
FROM [Chemical] 
WHERE [Name]='[1,1''-Bicyclohexyl]-2-carboxylic acid, 4'',5-dihydroxy-2'',3-dimethyl-5'',6-bis[(1-oxo-2-propen-1-yl)oxy]-, methyl ester'

However, if we modify the query to make it "sargable," then the execution plan results in an index seek, which is what we want:

SELECT [ChemicalId], [Name], [Description] 
FROM [Chemical] 
WHERE [Indexable_Name]='[1,1''-Bicyclohexyl]-' AND [Name]='[1,1''-Bicyclohexyl]-2-carboxylic acid, 4'',5-dihydroxy-2'',3-dimethyl-5'',6-bis[(1-oxo-2-propen-1-yl)oxy]-, methyl ester'

Is this a good solution if we control the format of all queries executed against the database via our middle tier? Is there a better way? Is this a major kludge? Should we be using full-text indexing?

+1  A: 

Did you try

WHERE [Name_Indexable]='1,2,3-Propanetriol'

After all that is where your index is created on

SQLMenace
+1 makes sense to me.
KM
Yes, sorry. I fixed my example to make it clearer as to what I am trying to achieve.
luksan
+2  A: 

Your index is on name_indexable, not on name. Since name_indexable is generated from a function involving name instead of directly on the column name, the optimizer won't automatically use the index when your where clause includes a reference to name. You must search on name_indexable in order to use the index. Since you have a middle tier, your best bet is probably to provide a function that searches on name_indexable if the given name is <= 200 characters, and otherwise searches on both.

Donnie
Yeah, this is what I am trying to avoid though.
luksan
You could try using a `WITH (INDEX xxx)` query hint. I'm not sure if this will actually work or not with the way you're computing the column, but it's worth trying.
Donnie
@Donnie interesting idea, still results in the same query though. Actually the index does get used in the original query, but it is performing a scan rather than a seek.
luksan
+2  A: 

Make the Name_Index column a persisted computed column and the primary key and enforce uniqueness by appending the ChemicalId instead of relying on triggers.

CREATE TABLE dbo.[Chemical]
    ([ChemicalId] int NOT NULL IDENTITY(1,1), 
    [Name] Nvarchar(max) NOT NULL, 
    [Description] Nvarchar(max) NOT NULL,
    [Name_Index] AS (CONVERT(VARCHAR(20), LEFT([Name], 20)) + CONVERT(VARCHAR(20), [ChemicalId])) PERSISTED PRIMARY KEY);
Registered User
If making it persisted allows the index to be used, good. If not, maybe you should bite the bullet and make it a "full-time" column.
Philip Kelley
That would guarantee that any value inserted into Name (even duplicates) could never violate the unique constraint, since the ID is always going to be unique.
luksan
Goal is not really to make it the primary key, we like our identity keys. I checked the query plan, your solution still does not allow us to cause an index seek without explicitly specifying [Name_Index] in the where clause.
luksan
A: 

Fix your data model. You have a comma-delimited list in the name column, to me that means you would better be able to query if you had a related table. Your name appears to be a list of ingredients not a name.

If this really truly is a real name, then Registered User has a good plan.

HLGEM
It's not a comma-delimited list, it's a chemical name. Really. I can't make this stuff up.
luksan
+1  A: 

IMHO, yes, I think this is a bad approach. If you knew that the first 20 characters will be unique, then it should be a first class column with a unique constraint. If you want to have better searching on the Name column, then using full-text search is the right way to go. If you want to ensure that the varchar(max) column is unique, then create a computed column that generates a hash off the value and place a unique constraint off that.

Alter Table Add NameHash Hashbytes('SHA1', [Name])

ADDITION

Given our discussion, if your searches are always going to be on a exact match, then you could hash your search parameter and compare it to NameHash above. However, the catch is that the match must be an exact match (i.e. case-sensitive).

I still content that the FTS will be your best bet. Even though there is overhead in breaking up your text into words, FTS is the tool best designed for doing searches against large amounts of text. The longer your search criteria, the more exact it will be, the faster the search.

Thomas
Yep, we know all about HASHBYTES. Still forces us to write a funky query to exploit the index though.
luksan
@luksan - How so? You would never use NameHash in your search. If you are searching on name, you would use full-text search and that will definitely use the (full-text) index.
Thomas
@Thomas If we enabled full-text indexing, yes. Problem is I don't know the overhead of doing that, whether it is as optimized for a scenario like this (matching the entire string as opposed to individual words), whether it is as fast as a regular index, etc. That's why I posted the question.
luksan
@luksan - Given the data you have show us and how long it is, the benefit would be enormous. FTS breaks up the data into words where "words" are determined by word breaks which are culture specific. In your case, you have obvious word breaks (the commas) in your text so FTS would work very well. In fact, I think the benefit would be enormous. The overhead is minimal and can be adjusted to scale well.
Thomas
@luksan - Keep in mind that the NameHash computed column I suggested was only to provide a means to ensure uniqueness on a varchar(max) column. It is entirely orthogonal to the need for an index to speed searches on names and for that Full Text Search would be the right solution.
Thomas
@Thomas query plan indicates full text index is only accessed if we use CONTAINS() function rather than equality operator. But we don't want to look for a substring, we want to match the entire string. Also, seems like the full-text engine would be wasting a lot of CPU time and storage space in attempting to break these silly chemical names up into "words."
luksan
@luksan - I should also mention that the dashes will likely be treated as word breaks so a search on say "dihydroxy" would be very fast with FTS.
Thomas
@luksan - If your searches are always on a unique match, then search on the hashed parameter. I.e., call Hashbytes on your input search criteria and search against NameHash.
Thomas
@Thomas tried to vote your answer up again, instead it voted it down and won't let me change my vote, ugh.
luksan
@luksan - Try it now (I edited my answer)
Thomas
@Thomas That worked.
luksan
We didn't go with the FTS but on further reflection HASHBYTES() seems less kludgy than LEFT(), so thanks!
luksan
+1  A: 

I find your solution from the question (the last query) very good, but I personally prefer to say SQL more exactly what and how I want to do. So if you works with Microsoft SQL Server or with some other SQL Server which supports CTE (common table expression) you can rewrite your query like following:

DECLARE @data nvarchar(max);

SET @data = '[1,1''-Bicyclohexyl]-2-carboxylic acid, 4'',5-dihydroxy-2'',3-dimethyl-5'',6-bis[(1-oxo-2-propen-1-yl)oxy]-, methyl ester';

WITH ReduceData ([ChemicalId], [Name], [Description]) AS (
    SELECT [ChemicalId], [Name], [Description] 
    FROM [dbo].[Chemical]
    WHERE [Name_Indexable]=LEFT(@data,20)
)
SELECT [ChemicalId], [Name], [Description] 
FROM ReduceData
WHERE [Name]=@data

(In the real implementation you probably don't need to define @data. Instead you can just use a parameterized query.). What I suggest is just to say SQL more explicit what you want. All CTE Queries can be very good optimized.

It can be that your original query will be compiled to the absolutely the same execution plan as my CTE version. You can look at the both plans and compare there. In your project you have probably much more complex queries as from your question. If you'll use more CTE, your SQL code will be easy to read, it can be very good optimized and you can be sure that SQL Server do exactly what you want.

UPDATED: By the way the line

ALTER TABLE [Chemical]
    ADD [Name_Indexable] AS LEFT([Name], 20)

should be changed to

ALTER TABLE [Chemical]
    ADD [Name_Indexable] AS CAST(LEFT([Name], 20) AS varchar(20)) PERSISTED

to make a [Name_Indexable] column of the type varchar(20) on the Microsoft SQL Server 2008 and mark it PERSISTED to stores the computed values in the table, and updates them when any other columns on which the computed column depends are updated

Oleg
I don't think you really need the CTE (just AND the two filter criteria), but the example of having a single parameter used twice is great. I totally support this and furthermore, you can wrap this in a parameterized inline table-valued UDF so that the index usage is enforced transparently. Obviously for set-based join operations, this is a bit more problematic, but common joins could be represented with views with the appropriate LEFT() usage.
Cade Roux
@Oleg Why does persisting the column matter? The column will be "persisted" in the index, which is the only place where it will really be needed (never will be used in a SELECT). Won't persisting it in the table just waste space? Won't the index nodes automatically be updated if the computed column is not persisted? What is the advantage of persisting?
luksan