views:

1626

answers:

5

When using the CHECKSUM column type to artificially create a hash index, is the lookup actually O(1) or is it still O(lg n) like it is for a clustered index? I have a table from which I will select based on its ID column and I need the lookup to be as fast as possible, so is the clustered index the fastest possible option? I am looking for something that will provide O(1) performance.

A: 

There's no advantage to searching an indexed CHECKSUM over a clustered index on the ID field if the ID field is an int since both will do a clustered index seek. Also, a CHECKSUM of an int column always returns the same value as the column (i.e. CHECKSUM(535) = 535). However, a CHECKSUM lookup will generally perform better if the ID is a long character column.

so is there any way to achieve better performance than a clustered index? Clustered index is still O(lg n) and I was looking for O(1)..
eulerfx
A: 

The lookup for a Hash index is O(1) and would be for any fairly random hash function such as a CHECKSUM. So for very large tables, it will almost always be the fastest lookup as long as it is coded efficiently and its capacity is large enough to include all the entries without excessive collisions (one hash leading to more than one valid value).

Since your CHECKSUM is already created and is a fast hash to create, it would be a good hash to use, but check the number of collisions and maybe tweak the capacity so that fewer than 20% of the values collide. Also add your most common values first to the hash table so that they will be matched immediately without having to follow whatever collision resolution technique your particular hash table implements.

If you don't have many values (i.e only hundreds or thousands), then it is possible that a clustered index will still be faster, and it is definitely more convenient since it's already built for you.

The big advantage of the index over the hash is that the values in the index are sorted for you. To access values from a hash table in sorted order. you need to create and maintain a separate sorted table. Doing so is O(log(n)) and you lose the O(1) benefits of the hash table.

lkessler
+5  A: 

Okay, 2 points.
The SQL CHECKSUM function does not produce a hash value. It actually calculates a CRC value. It is not a very good candidate to base a hash check on becuase there will be a relativly large number of collisions. You should check the hash_bytes function if you want a hash function.
Secondly, you are not actually creating a hash index. You are creating a normal b-tree on a hash value so the lookup time will be exactly the same as for any other b-tree index on a similar sized data type.
There is a chance that you could gain a little performance by using a CRC or hash of a long varchar value to allow comparisons of a smaller number of bytes, but string comparison only checks as many bytes as it needs to, which is as far as the first character that doesn't match, and if you do match on the hashed value, you then need to double check the actual value anyway. So unless you have a lot of very similar strings you will probably end up comparing MORE bytes by using the hash (or CRC).

In short, I don't think this is a sensible plan, but as with all optimisations you should test it in your specific case and then decide. I would be interested to see your results if you would care to post them. And I don't believe that there is any faster way to locate a row in SQL server than by using a clustered index.

In case you care, Ingres (by CA) can create hash indexes which would then achive O(1). there may be other RDBM's out there that also support true hash indexes.

pipTheGeek
I don't agree. The CRC's should be fairly random after you MOD some part of it by the number of buckets. I don't see why you think there would be "a relatively large number of collisions".
lkessler
For a test, I just checked for collisions on a column of 11k strings (mostly URLs, so lots of equal initial segments). With BINARY_CHECKSUM I got 3 3-way collisions and 5 2-way collisions. With HASHBYTES I got none, as you'd expect, even using MD2.
Steve Jessop
+1  A: 

You can try to set things up to use a hash join, you can look at the execution plan to verify a hash join is actually used. When hash joins are used, SQL Server will still build the hash table first as part of executing the individual query. I believe indexes are never stored as a hash, only as trees.

In general I would not create an artificial hash column unless you're doing exact matches against potentially large strings or binary blobs (as pipTheGeek mentions). I just wanted to add that sometimes this is necessary as strings might be too large to fit in an index key. There is a limit to the size of index keys of I think 2k for SQL Server.

Of course, in your join you need to include the hash column and the source column to resolve any ambiguities that result from the hash.

Frank Schwieterman
+3  A: 

I don't think that SQL server natively has a hash table based index. The BOL documentation is talking about building a standard (tree) index on a calculated value. This is not the same thing as a Linear Hash Table, which is an index structure available on some DBMS platforms, but not SQL Server (AFAIK).

You may get some benefit from using the technique described in this blog post to hash large string values such as URL's for faster lookup. However, the underlying index is still a tree structure and is O(Log N).

ConcernedOfTunbridgeWells