views:

864

answers:

7

I have a deceptively simple SQL Server query that's taking a lot longer than I would expect.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT COUNT(DISTINCT(guid)) FROM listens WHERE url='http://www.sample.com/'

'guid' is varchar(64) NULL

'url' is varchar(900) NULL

There is an index on guid and url.

There are over 7 million rows in the 'listens' table, of which 17,000 match the url in question, and the result of the query is 5,500.

It is taking over 1 minute to run this query on SQL Server 2008 on a fairly idle Dual-Core AMD Opteron 2GHz with 1GB RAM.

Any ideas how to get the execution time down? Ideally it should be under 1 second!

A: 

Your GUID column will, by nature, be a lot more labour-intensive than, say, a bigint as it takes up more space (16 bytes). Can you replace the GUID column with an auto-incremented numerical column, or failing that, introduce a new column of type bigint/int that is incremented for each new value of the GUID column (you could then use your GUID to ensure global uniqueness, and the bigint/int for indexing purposes)?

From the link above:

At 16 bytes, the uniqueidentifier data type is relatively large compared to other data types such as 4-byte integers. This means indexes built using uniqueidentifier keys may be relatively slower than implementing the indexes using an int key.

Is there any particular reason why you're using a varchar for your guid column rather than uniqueidentifier?

davek
The guid is a value that's provided from an outside source. Currently it does happen to look like a uniqueidentifier but that's not guaranteed, so it needs to be a string. I could potentially create another table mapping an int to the guid, but that would make inserts into 'listens' more expensive, and I need to keep inserts fast.
Tim Norman
A: 

Have you tried eliminating the DISTINCT keyword and getting a grouping instead?

SELECT 
guid, COUNT(*)
FROM listens
WHERE url='http://www.sample.com'
GROUP BY guid
George
+1 I always go for group by over distinct.
Vaccano
Will that really make that much of a difference? : http://stackoverflow.com/questions/164319/is-there-any-difference-between-group-by-and-distinct
davek
Wrong code - it don't conunt number of distinct guids, only lists all guids and count number of entries for each guid.
ThinkJet
+4  A: 

Create an index on url which would cover the GUID:

CREATE INDEX ix_listens_url__guid ON listens (url) INCLUDE (guid)

When dealing with urls as identifiers, it is much better to store and index the URL hash rather than the whole URL.

Quassnoi
note that creating such wide indexes isn't really a good idea. they just take space and are useful only in few cases.i agree about the url hashing although i prefer the checksum (it's narrower and inex is faster) as i said in my answer.
Mladen Prajdic
Indexing by URL column is enough, there are olny one need: how to tell super-puper-intellegent MS SQL Server how to build right query plan :)
ThinkJet
the query plan is least of your worries. it's the IO that's the problem. huge indexes will cause huge IO.
Mladen Prajdic
Can't agree. Right query plan comes first. If query plan is wrong you always get your "huge IO" as result of full table scan w/o using indexes, full index scans, joining through non-required tables etc.
ThinkJet
Indexes must provide good data granularity. Fast locating of 17000 records from 7 millions is good enough. Hashing, caching and another search algorithim optimizations is part of work, done by SQL Server.
ThinkJet
+1  A: 

scaning indexes that large will take long no matter what.
what you need to do is to shorten the indexes.
what you can do is have an integer column where the checksum of the url is calculated and stored. this way your index will be narrow and count will be fast.

note that checksum is not unique but it's unique enough. here's a complete code example of how to do it. I've included checksums for both columns but it probably needs only one. you could also calculate the checksum on the insert or update by yourself and remove the trigger.

CREATE TABLE MyTable
(
    ID INT IDENTITY(1,1) PRIMARY KEY,
    [Guid] varchar(64),
    Url varchar(900),
    GuidChecksum int,
    UrlChecksum int
)
GO

CREATE TRIGGER trgMyTableCheckSumCalculation ON MyTable
FOR UPDATE, INSERT
as
UPDATE t1
SET    GuidChecksum = checksum(I.[Guid]),
       UrlChecksum = checksum(I.Url)
FROM   MyTable t1 
       join inserted I on t1.ID = I.ID

GO
CREATE NONCLUSTERED INDEX NCI_MyTable_GuidChecksum ON MyTable(GuidChecksum)
CREATE NONCLUSTERED INDEX NCI_MyTable_UrlChecksum ON MyTable(UrlChecksum)

INSERT INTO MyTable([Guid], Url)
select NEWID(), 'my url 1' union all
select NEWID(), 'my url 2' union all
select null, 'my url 3' union all
select null, 'my url 4'

SELECT  *
FROM    MyTable

SELECT  COUNT(GuidChecksum)
FROM    MyTable
WHERE   Url = 'my url 3'
GO

DROP TABLE MyTable
Mladen Prajdic
+1 if you would add an example of how the select would look like in this scenario. (where url_crc = crc('url') and url = 'url') or something like that.
Lieven
Hashing (called "checksums" here) not an answer because it's not unique and real value of `url` field MUST be tested against given value. Therefore SQL Server MUST read real value of field.
ThinkJet
-1 At least `select count() ...` query are wrong: 1) real distinct guids must be counted, not non-unique checksums 2) UrlChecksum must be added in WHERE clause, server don't have any reason for using index by UrlChecksum
ThinkJet
you do realize that this is an example of concept he should use not the actual solution?
Mladen Prajdic
Yes, I realize that. But form of query are critical for asked question.
ThinkJet
@ThinkJet-comment2: I think the idea here is to use the checksum to get a subset of all results (inclusive of needed results) and then use the actual value to refine those results. Once you reduce with the checksum the 7mil rows to something close to the 17k at issue sqlserver will be faster... 7mil is a lot to do a big key search on.
Hogan
A: 

Some hints ...

1) Refactor your query, e.g. use with clause ...

    with url_entries as (  
      select guid   
      from listens   
      where url='http://www.sample.com/'  
    )   
    select count(distinct(enries.guid)) as distinct_guid_count   
    from url_entries entries

2) Tell exact SQL Serever which index must be scanned while performing query (of course, index by url field). Another way - simple drop index by guid and leave index by url alone. Look here for more information about hints. Especially for constructions like select ... from listens with (index(index_name_for_url_field) )

3) Verify state of indexes on listens table and update index statistics.

ThinkJet
A: 

I bet if you have more than 1GB of memory in the machine it would perform better (all DBA's I've met expect at least 4GB in a production SQL server.)

I've no idea if this matters but if you do a

  SELECT DISTINCT(guid) FROM listens WHERE url='http://www.sample.com/'

won't @rowcount contain the result you want?

Hogan
A: 

Your best possible plan is a range seek to obtain the 17k candidate URLs and the count distinct to rely on a guaranteed order of input so it does not have to sort. The proper data structure that can satisfy both of these requirements is an index on (url, guid):

CREATE INDEX idxListensURLGuid on listens(url, guid);

You already got plenty of feedback on the wideness of the key used and you can definetely seek to improve them, and also increase that puny 1Gb of RAM if you can.

If is possible to deploy on SQL 2008 EE, then make sure you turn on page compression for such a highly repetitive and wide index. It will do miracles on performance due to reduced IO.

Remus Rusanu