+3  A: 

OF COURSE a binary(16) will be MUCH faster - just do the quickest of calculations:

  • a SQL Server page is always 8K
  • if you have 16 bytes per entry, you can store 500 entries on a page
  • with 4000 bytes per entry (nvarchar) you'll end up with 2 entries per page (worst case, if your NVARCHAR(2000) are fully populated)

If you have a table with 100'000 entries, you'll have to have 200 pages for the index with a binary(16) key, while you'll need 50'000 pages for the same index with nvarchar(2000)

Even just the added I/O to read and scan all those pages is going to kill any performance you might have had........

Marc

UPDATE:
For my usual indexes, I try to avoid compound indexes as much as I can - referencing them from other tables just gets rather messy (WHERE clauses with several equality comparisons).

Also, regularly check and maintain your indices - if you have more than 30% fragmentation, rebuild - if you have 5-30% fragmentation, reorganize. Check out an automatic, well tested DB Index maintenance script at http://sqlfool.com/2009/06/index-defrag-script-v30/

For the clustered key on a SQL Server table, try to avoid GUID's since they're random in nature and thus cause potentially massive index fragmentation and therefore hurt performance. Also, while not a hard requirement, try to make sure your clustered key is unique - if it's not, SQL Server will add a four-byte uniqueifier to it. Also, the clustered key gets added to each and every entry in each and every non-clustered index - so in the clustered key, it's extremely important to have a small, unique, stable (non-changing) column (optimally it's ever-increasing , that gives you the best characteristics and performance --> INT IDENTITY is perfect).

marc_s
What else besides pure space considerations? If several other columns are stored with the index, so your # of pages comparison isn't quite as drastic, what other differences would there be?
Rex M
+2  A: 

You can have at most 900 bytes per index entry, so your nvarchar(2000) won't fly. The biggest difference will be index depth - the number of pages to traverse from the root to the leaf page. So, if you need to search, you can index on CHECKSUM, like this:

alter table recipe add text_checksum as checksum(recipe_text)
create index text_checksum_ind on recipe(text_checksum)

(example from here Indexes on Computed Columns: Speed Up Queries, Add Business Rules) which will not give you an exact match, only narrow down your search very well.

Of course, if you need to enforce uniqueness, you'll have to use triggers.

Another idea is to zip your nvarchar to a smaller binary value, and index on that, but can you guarantee that every value is always zipped to 900 bytes or less?

AlexKuznetsov
+1 excellent point, yes - 900 bytes is the max for an index entry.
marc_s
You need a much bigger hash than a 32 bit checksum. CHECKSUM returns int and it will have, in the *best* case, a 50% probability collision after only 64k records, a very, very small table. http://rusanu.com/2009/05/29/lockres-collision-probability-magic-marker-16777215/
Remus Rusanu
Remus, with a bigger hash you will have less chance to get false positives, but you will still have some. Only triggers in this case.
AlexKuznetsov
Right, if you decide to enforce it with trigger then a fast, small hash is OK, since you'll be resolving conflicts 'manually' anyway. A large enough hash on the other hand allows you to rely on chance alone and don't allow duplicates (if conflict is reasonably unlikely, even with meet-in-the-middle) and then you can rely on index uniqueness, much more efficient than trigger. Its a trade off, of course, the right path depends from case to case.
Remus Rusanu
+3  A: 

You're thinking about this from the wrong direction:

  • Do create indexes you need to meet performance goals
  • Do NOT create indexes you don't need

Whether a column is a binary(16) or nvarchar(2000) makes little difference there, because you don't just go add indexes willy nilly.

Don't let index choice dictate your column types. If you need to index an nvarchar(2000) consider a fulltext index or adding a hash value for the column and index that.


Based on your update, I would probably create either a checksum column or a computed column using the HashBytes() function and index that. Note that a checksum isn't the same as a cryptographic hash and so you are somewhat more likely have collisions, but you can also match the entire contents of the text and it will filter with the index first. HashBytes() is less likely to have collisions, but it is still possible and so you still need to compare the actual column. HashBytes is also more expensive to compute the hash for each query and each change.

Joel Coehoorn
Actually, that is one of the reasons I am asking this - would a short binary hash of a large field be better to index?
Rex M
A hash column can only seek an exact match. If you don't need partial matches (LIKE 'foo%') nor ranges (BETWEEN 'A' AND 'B') then you can use hashes.
Remus Rusanu
Okay: now we're looking at a different question: "I need to index an nvarchar(2000) column. The goal is to make this type of query run faster: ______. How should I do that?"
Joel Coehoorn
@Joel thanks, I've narrowed the scope of the question to that.
Rex M
+1  A: 

In index max length is 900 bytes anyway, so you cannot index NVARCHAR(2000).

A larger index key means fewer keys fit in the index pages so it creates a larger tree, more disk used, more I/O, more buffer pull, less caching. For clustered keys this is far worse because the clustered key value is used as the lookup value on all other non-clustered, indexes, so it increases the size of all indexes.

Ultimately the single most prevalent performance driving metric in a query is the number of pages scanned/seek-ed. This translates into physical reads (=I/O wait time) or logical reads (=cache pollution).

Other than space considerations, data types make little to no difference in a query behavior. char/varchar/nchar/nvarchar have collations that needs to be taken into account on comparisons, but the cost of collation order lookup is usually not a deciding factor.

And last but not least, probably the most important factor, is your application access pattern. Index the columns that make queries SARGable, there is absolutely no benefit in having to maintain an index that is not used by the optimizer.

And sometimes you have to consider concurrency issues, like when you have to eliminate deadlocks caused by distinct update access path to the same record.

Update after post edit

Use a persisted MD5 hash column:

create table foo (
    bar nvarchar(2000) not null, 
    [hash] as hashbytes('MD5', bar) persisted not null,
    constraint pk_hash unique ([hash]));
go


insert into foo (bar) values (N'Some text');
insert into foo (bar) values (N'Other text');
go

select * from foo
    where [hash] = hashbytes('MD5', N'Some text');
go

You have to be very careful with your seeks, the hash will differ wildly for any difference in input, ie. if you seek Ascii parameter instead of Unicode one...

You'll have a decent collision chance if your table grows big.

Remus Rusanu
A: 

Actually it is better to benchmark and see for yourself. For example, the following script compares an index seek via a 4 byte integer vs. a seek via a 50 byte char. It is 3 reads for an int (the depth of the B-tree built on an INT column) vs 4 reads for a char (the depth of the B-tree built on an CHAR column).

CREATE TABLE dbo.NarrowKey(n INT NOT NULL PRIMARY KEY, m INT NOT NULL)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.NarrowKey(n,m) SELECT 1,1;
WHILE @i<1024000 BEGIN
  INSERT INTO dbo.NarrowKey(n,m)
    SELECT n + @i, n + @i FROM dbo.NarrowKey;
  SET @i = @i * 2;
END;
GO
DROP TABLE dbo.WideKey
GO
CREATE TABLE dbo.WideKey(n CHAR(50) NOT NULL PRIMARY KEY, m INT NOT NULL)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.WideKey(n,m) SELECT '1',1;
WHILE @i<1024000 BEGIN
  INSERT INTO dbo.WideKey(n,m)
    SELECT CAST((m + @i) AS CHAR(50)), n + @i FROM dbo.WideKey;
  SET @i = @i * 2;
END;
GO
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT * FROM dbo.NarrowKey WHERE n=123456
SELECT * FROM dbo.WideKey WHERE n='123456'

Index seeks are 33% slower for a wider key, but the table is 4 times bigger:

EXEC sp_spaceused 'dbo.NarrowKey';
-- 32K
EXEC sp_spaceused 'dbo.WideKey';
-- 136K
AlexKuznetsov