views:

320

answers:

9

Hello,

I have a table with about 20+ million records.

Structure is like:

EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT

Table is receiving about 100k+ records each day.

I have indexes on each column except MetaId, as it is not used in 'where' clauses

The problem is when i want to pick up eg. latest 100 records for desired SourceUserId

Query sometimes takes up to 4 minutes to execute, which is not acceptable.

Eg.

SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
    OR 
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC

I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.

I also think that the table is quite small to be that slow.

I think the problem is with ORDER BY clause as db must go through much bigger set of data.

Any ideas how to make it quicker ?

Perhaps relational database is not a good idea for that kind of data.

Data is always being picked up ordered by CreatedAt DESC

Thank you for reading.

PabloX

A: 

I would make sure CreatedAt is indexed properly

Jim B
It would be nice to benchmark the query with and without the `Order By CreatedAt` clause!
p.campbell
An index solely on CreatedAt wouldn't likely help this type of query as well as a composite on the equality filter SourceUserId - an index solely on CreatedAt would simply allow the optimizer to scan backwards down a B-Tree doing a residual filter check on the SourceUserId until it found 100 with the specified value (if the SourceUserId is very selective, this would likely cause a fairly long scan). If you composite index on SourceUserId then CreatedAt, the optimizer can skip the scan, seek to the given SourceUserId and simply pull out the pre-ordered final 100.
chadhoc
+12  A: 

You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.

So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:

  1. Index on (SourceUserId, CreatedAt)
  2. Index on (SourceUserId, TypeId, CreatedAt)

As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.

chadhoc
+1: and `covering index` in SQL Server terminology.
OMG Ponies
Yeap, as with soo many SQL questions, the answer is not some magic SELECT trick but good old fashion index design.
Remus Rusanu
Clustering this index and having a striped disk would help as well.
Mark Canlas
thank you very much, that speeded up things significantly :) i should learn more about indexes :)
pablox
A: 

you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like

   SElect * FROM(
 SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION  SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461' 
 AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC

Also, check that the uniqueidentifier indexes are not CLUSTERED.

vladhorby
+1  A: 

I would recomend getting the data in 2 sep var tables

INSERT INTO @Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
)
INSERT INTO @Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)

then apply a unoin from the selects, ordered and top. Limit the data from the get go.

astander
+1  A: 

I suggest using a UNION:

SELECT TOP 100 x.*
  FROM (SELECT a.*
          FROM EVENTS a
         WHERE a.typeid IN (2, 3, 4)
        UNION ALL
        SELECT b.*
          FROM EVENTS b
         WHERE b.typeid = 60 
           AND b.srcmemberid != b.dstmemberid) x
 WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'
OMG Ponies
+4  A: 

I have indexes on each column except MetaId

Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)

Following index is also usefull, altough it still going to cause lookups:

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)

and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):

INDEX ON (SourceUserId , CreatedAt)

But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.

See Index Design Basics.

Remus Rusanu
Thank you! That helped me :)
pablox
A: 

If there are 100K records added each day, you should check your index fragmentation. And rebuild or reorganize it accordingly. More info : SQLauthority

Leon
yep i am doing this as well
pablox
+4  A: 

The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:

  • High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
  • High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
  • Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.

Here a couple of resources on how to investigate and resolve these problems:

Enrico Campidoglio
A: 

We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.

Keith Williams