views:

560

answers:

5

I have a query (which was created by LINQ to SQL) to get me a list of 'site visits' that were made between a certain date range which resulted in an order (orderid is not null).

Theres nothing wrong with the query. I just need advice on creating the correct index for it. I was playing around trying different combinations on a production site and managed to screw things up such that a foreign key got disconnected. I fixed that after some panic - but thought I'd ask for advice now before recreating an index.

The table is getting close to a million rows and I need the indexes to help me out here. This query is only used for reporting so doesnt have to be extremely fast, just not delay other user's queries (which it is doing).

SELECT TOP 1000
  t0.SiteVisitId, t0.OrderId, t0.Date, 
  t1.Domain, t0.Referer, t0.CampaignId
FROM 
  SiteVisit AS t0
  LEFT OUTER JOIN KnownReferer AS t1 ON t1.KnownRefererId = t0.KnownRefererId
WHERE
  t0.Date <= @p0 
  AND t0.Date >= @p1
  AND t0.OrderId IS NOT NULL
ORDER BY 
  t0.Date DESC

@p0='2008-11-1 23:59:59:000', @p1='2008-10-1 00:00:00:000'

I currently have a clustered index on SiteVisitId, which is my identity integer column.

I dont know which of the following are most likely to be most efficient:

  • Create an index on Date
  • Create an index on Date AND a separate index on OrderId
  • Create a 'multicolumn' index on Date AND OrderId
  • Some other combination?

I am also wondering whether I should create a separate bit column for hasOrder instead of checking if OrderId IS NOT NULL if that might be more efficient.

FYI: The KnownReferer is just a table which contains a list of 100 or so known HttpReferers so i can easily see how many hits from google, yahoo etc.

+1  A: 

How many rows do you expect to have between a typical date range? Are you typically looking at a month at a time?

I would start out with an index over the [Date] column. If, for a typical query, your resulting row count is small you shouldn't need to add the [OrderId] column to your index.

On the other hand, if you have a large amount of rows in a typical month, then you can add the [OrderId] column to the index, though since it's being treated as a boolean value, it may not buy you much. It depends on how many rows are NULL vs NOT NULL. If you have you a lot of rows for a given month, but only a few have a valid [OrderId] then the index would probably improve performance.

Read the accepted answer in this related question and determine if it's worth indexing over the additional column:

http://stackoverflow.com/questions/231125/should-i-index-a-bit-field-in-sql-server

And of course, test the indexes and the plans generated with and without the index.

Update: Some of the other answers specify a more aggressive index, which should improve performance of this query, but may adversely affect other operations on the table. For example, the covering index suggested will allow SQL Server to process this query with little impact to the actual table, but may cause problems when other queries write to the actual table (since SQL Server will need to update both the table and covering index in that case).

Because this is a reporting query, I would optimize it as little as possible. If this query is running long, causing other, more critical, queries to run slowly or timeout, I would only optimize this query enough to reduce it's affect on those other queries.

Though, if you expect this table to continue growing I would consider a separate reporting schema and periodically extract data from this table.

Brannon
A: 

It's also worth considering if you need to store rows in the SiteVisit who don't have a KnownRefererId in your KnownReferer table, and who have a Null OrderId. If you don't need these, change delete them from the table, and change your clustered index to be on both SiteVisitId and Date and the query should be pretty quick.

But I'm sure your storing those extra rows for a reason.

Ben R
A: 

If you really want to optimize the bejesus out of this query and you can accept a slightly slower insert into the table, you would create an index on:-

(Date, OrderId, SiteVisitId, Domain, Referer, CampaignId)

This will let the database return an answer entirely from the index without any sorting or separate table access.

WW
I wouldn't include that many coulmns in a index unless proven to be 100% necessary. If using SQL Server 2005, you would be better off INCLUD'ing the columns that remove a bookup lookup into the clustered index (also, Domain is in another table!)
Mitch Wheat
+1  A: 

I would create an index on Date and OrderId and INCLUDE columns SiteVisitId, Referer,CampaignId (assuming you are using SQL Server 2005 onwards). Also create an index on the Foreign key column KnownRefererId.

Given that this is a reporting query and could withstand the odd uncommited row, I would suggest using the NOLOCK (or READ UNCOMMITED hint):

using (var trans = new TransactionScope(TransactionScopeOption.Required,
                      new TransactionOptions
                      {
                          IsolationLevel = IsolationLevel.ReadUncommitted
                      }))
{
    // Put your linq to sql query here
}

Ref.

Caveat: Only use NOLOCK hints where you have a very good reason. In the past I have seen developers come to grief through blanket use!

Mitch Wheat
+1 for the excellent index advice, -0.5 for suggesting NOLOCK; it'll just encourage using it all the time. Better to tune the indexing schemes to lower records read than even trying the UNCOMMITED route.
SqlACID
@SqlACID: that's why I put "Caveat: Only use NOLOCK hints where you have a very good reason"!
Mitch Wheat
reports can "withstand" uncommitted rows? HUH? That's like saying Science can get along fine with a handful of untrue facts. Seriously, SQL Server people have a totally different perception of "TRUTH". What if it's rolled back? You'll have a report that includes that record, and no record anywhere.
A: 
SELECT TOP 1000
  t0.SiteVisitId, t0.OrderId, t0.Date, 
  t1.Domain, t0.Referer, t0.CampaignId
FROM 
  SiteVisit AS t0
LEFT OUTER JOIN KnownReferer AS t1 ON t1.KnownRefererId = t0.KnownRefererId
WHERE
  t0.Date <= @p0 
  AND t0.Date >= @p1
  AND t0.OrderId IS NOT NULL
ORDER BY 
  t0.Date DESC

@p0='2008-11-1 23:59:59:000', @p1='2008-10-1 00:00:00:000'

I'm going to guess on table stats here, and the resulting design could slow down other queries - but that's generally the trade off. I usually find that when moving a clustered index, it's best to create a replacement index to avoid upsetting other queries too much.

On the assumption that there a lot of rows in the 1 month date range, and relatively few of those have OrderId IS NULL - you'd be best off with a clustered index on Date. That should give you a clustered index scan, with results nicleley ordered for your TOP 1000.

You may also want KnownReferer.KnownRefererId to be either the clustered index or in a combined index with knownRefererId + Domain to avoid a lookup into that table. I'd guess you're number of KnownReferers is small though - so I wouldn't expect much benefit from this.

Mark Brackett