tags:

views:

146

answers:

5

I have a table which may have anywhere from 10,000 to 10 million rows during the life of the application. This table contains NACSZ info, among other things, and I need to write a query that checks for matching NACSZ like so:

select 
    * 
from 
    Profiles 
where 
    FirstName = 'chris' and
    LastName = 'test' and
    Address1 = '123 main st' and
    City = 'chicago' and
    State = 'il' and
    Zip = '11111'

I'm trying to optimize the table for this query which will be run inside an "if exists()" construct, and not having much luck. With no index, and ~110,000 rows of test data, I get:

Table 'Profiles'. Scan count 1, logical reads 2021, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times: CPU time = 16 ms, elapsed time = 70 ms.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.

The execution plan indicates a clustered index scan on the primary key. I thought about adding an index for all the values, but that would create a rather large key and I'd like to try to avoid that if possible. My next thought was to index something that would be reasonably different between rows, so I tried indexing on first name (though I could use address line 1, or zip, for example), so I created the index:

create index ix_profiles_firstName on profiles(FirstName)

Now, when I run the same query, I get

Table 'Profiles'. Scan count 1, logical reads 171, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 52 ms.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 1 ms.

Obviously, indexing first name made a HUGE difference. My question is, how do I decide whether I should index first name vs last name vs address vs. zip code? Is there a command I can run over my sample data that will tell me uniqueness of values in each field? My understanding is that I should try to index the column with the most uniqueness to make the index work best, correct?

+3  A: 

For your query, you should create a composite index on all columns: (FirstName, LastName, address1, city, state, zip)

If you want a certain index to be used in SQL Server, issue:

SELECT  *
FROM    Profiles WITH (INDEX (index_name))
WHERE 
        FirstName = 'chris' and
        LastName = 'test' and
        Address1 = '123 main st' and
        City = 'chicago' and
        State = 'il' and
        Zip = '11111'

My question is, how do I decide whether I should index first name vs last name vs address vs. zip code?

Index all these values you are filtering on.

Note that you can efficiently filter on the first columns from the index, like:

SELECT  *
FROM    Profiles
WHERE   FirstName = 'chris'

will use the index to search on FirstName,

SELECT  *
FROM    Profiles
WHERE   FirstName = 'chris'
        AND LastName = 'test'

will use the index to search on both FirstName and LastName,

SELECT  *
FROM    Profiles
WHERE   FirstName = 'chris'
        AND City = 'chicago'

will use the index to search only on FirstName (you don't filter on LastName, there is a gap, and the index cannot be used to search on other columns)

Is there a command I can run over my sample data that will tell me uniqueness of values in each field?

SELECT   COUNT(DISTINCT FirstName) / COUNT(*)
FROM     Profiles

will show you FirstName reciprocal selectivity.

The more this value is, the less efficient is the index.

My understanding is that I should try to index the column with the most uniqueness to make the index work best, correct?

Yes.

Again, in your case you should index all columns. Most uniqueness is for sure on all columns taken together.

Quassnoi
I tried the composite index before and wasn't gettting good results. For whatever reason, I just tried it again and got even better results. I use jmeter to submit randomized data to a web page that calls this procedure and was able to process 1500+ lookups/inserts per second. Awesome.
Chris
With the composite index:Table 'Profiles'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Chris
+1  A: 

My question is, how do I decide whether I should index first name vs last name vs address vs. zip code?

Gather all queries you intend to use (if this is the only one, you're done). Then turn over the queries as a workload to the Index Tuning Wizard, and look at the recommendations.

My understanding is that I should try to index the column with the most uniqueness to make the index work best, correct?

The more unique an index is, the fewer results will be looked up out of the actual table. The narrower the index is, the faster it can be read. (this rule shows why a composite index on all criteria columns is no good).

David B
Regarding your last statement, that's what I was taught / learned. In this case though, the actual performance hit from using a composite index doesn't seem to be present or relevant. I'm going to load a few million test rows over the weekend and see if that changes.
Chris
A: 

In addition to other answers...

What combination of filters will you be running? Try to cover the most popular combinations.

gbn
The query will always try to match all columns - basically trying to prevent duplicate registrations.
Chris
Ta. Quassnoi's answer is best
gbn
+1  A: 

If this query is as critical as it seems, then I suggest you create a derived column on the fields as concatenated; then in your query explicitly create the key. Sure it's redundant, but it may end up being simpler than messing with it in pieces if you don't have to.

le dorfier
+1  A: 

You have a couple of options. As Quassnoi indicated you could create a composite index. Another option I used in a slightly different scenaior was to generate a unique key based on the data. In my situation I was comparing Addresses and trying to prevent duplicates (Since we would geocode any new address and each geocoding cost $$).

Anyways essentially we took key pieces of the address and created a new key (Address, State & Zip). You could do the same thing, and then only compare against one column.

One gottcha is making sure you synchronize this column when the record changes. You could look at using a computed column and indexing that which might help to that effect.

JoshBerke
I have used the method you describe in the past. Shame on me for not thinking of it in this case. Giving the particular scenario I'm testing for, I think Quassnoi's solution is sufficient, but I will keep yours in mind as well.
Chris