views:

473

answers:

4

Say I have a million-row table [mytable] in my SQL Server 2005 database, which has columns:

  • ID
  • SomeField
  • AnotherField
  • Url

A very common query in this system is:

select * from [mytable] where url = 'http://www.somesite.com/some/really/long/url'

Which will give me better performance:

a) Add an index to the Url column.

or

b) Add an additional "url_hash" column that contains a numeric hash corresponding with the url, then compute that hash for use in my where clause, e.g.:

select * from [mytable] where url_hash = some-computed-hash and url = 'http://www.somesite.com/some/really/long/url'

Is (b) worth the extra complexity? I would need to compute the hash on insert and select.

UPDATE 03-30-2009

ID is the primary key

Also, the queries above should not have "*". Instead the select list should be all fields in the table.

"*" was just shorthand - sorry for the confusion.

UPDATE 03-31-2009

Also, forgot to mention, there would be an index on the url_hash field.

+4  A: 

If you only select the columns you require (as oppose to '*'), and you create a covering non-clustered index on 'Url' and the selected columns, you will get very efficient lookups.

Mitch Wheat
A: 

Even if you calculate a hash code for the URLs you won't get very good performance unless you add an index for the hash code column, so it's better to just add an index on the URL column instead.

Guffa
+3  A: 

Simply put, the longer a string is, and the more similar two strings are, the longer they will take to compare (consider a string 1000 characters long where the only difference is the last character, you can see how long it will take before routine discovers the discrepancy).

But, lets contrast that cost of comparing a long string to the cost of locating them on disk.

Indexes are stored in B+Trees, which are balanced trees with a variable number of the nodes, and where each node is linked to the other (a -> b -> c). This gives us two capabilities: quick look up by walking the tree, and then quick, tree order access to other nodes (once you find 'a', it's easy to find 'b', then 'c', etc.).

Indexes are laid out in disk pages, and in general the more nodes you can cram in to a index page, then the lower the overall height of the index B+tree is. The lower the tree height, the faster you will be able to find a specific row, since you will typically traverse the height of the tree (since it's balanced) to get to any one leaf node.

The lower the height, the fewer disk hits you have to make. If you have a tree that's 4 high, then to reach any random node requires loading 4 index pages in to RAM, and that's 4 disk hits. So, a 4 high tree is "twice as efficient" (for assorted values of "twice") as a 8 high tree.

Also, the more you can put in an index page, the fewer hits you'll need if you start iterating along the nodes. If your nodes hold 10 key values, loading a hundred rows will cost you 10 index page hits, whereas if it only holds 5 per node, you get twice the index disk hits.

Note, that you get a geometric progression in terms of number of records you need to add a new layer to the tree. (i.e. the difference between a 5 key node and a 10 key node is not twice the records.)

So, that's the value of have small keys -- lots of fan out in your index trees.

Mind, with a hash, you'd still have to do "where hash = and url='...'".

But it really comes down to your data access patterns, truthfully. How busy is the DB, what kind of queries you make, how much RAM you have to cache index pages, etc.

The index hit to locate your initial row may well not even be on the radar of your query times.

The key takeaway is that the number of records isn't important, but the fan out of the index tree is. For example, if you have a 1K index node, and a 4 byte index (long int), you can get 250 nodes per index (being very simplistic here), and a 3 layer tree can get, what, 16M rows all within a 3 deep tree -- any of 16M rows within 3 disk hits.

Will Hartung
So it sounds like you're saying that SO has a legitimate concern because they keyfield is long and SO should spend time investigating this strategy because it is likely to improve query times? In other words, you're encouraging this design?
le dorfier
I'm just pointing out the ramifications of his decisions. I think he should try it both ways and see if one way is significantly more performant than the other, get more data. As with everything else "it depends".
Will Hartung
A: 

Is this describing a real problem you have with a real table in a real application, or are you looking for an optimization you don't know if you need yet?

If not #1, then I suggest you index the url, and work on the rest of the application until you have a problem (which will be unlikely).

le dorfier