views:

409

answers:

5

Hi,

I'm using SQL Server 2008 on Windows Server 2008 R2, all sp'd up.

I'm getting occasional issues with SQL Server hanging with the CPU usage on 100% on our live server. It seems all the wait time on SQL Sever when this happens is given to SOS_SCHEDULER_YIELD.

Here is the Stored Proc that causes the hang. I've added the "WITH (NOLOCK)" in an attempt to fix what seems to be a locking issue.

ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;

SELECT 
    c.ForeignId , ct.ContentSource as ContentSource
    , sum(ch.HitCount * hw.Weight) as Popularity
    , (sum(ch.HitCount * hw.Weight) * 100) / @Total as Percent
    , @Total as TotalHits
from 
    ContentHit ch WITH (NOLOCK)
    join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
    join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
    join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where 
    ch.CreatedDate between @Then and @Now
group by
    c.ForeignId , ct.ContentSource
order by
    sum(ch.HitCount * hw.HitWeightMultiplier) desc
END

The stored proc reads from the table "ContentHit", which is a table that tracks when content on the site is clicked (it gets hit quite frequently - anything from 4 to 20 hits a minute). So its pretty clear that this table is the source of the problem. There is a stored proc that is called to add hit tracks to the ContentHit table, its pretty trivial, it just builds up a string from the params passed in, which involves a few selects from some lookup tables, followed by the main insert:

BEGIN TRAN
insert into [ContentHit] 
    (ContentId, HitCount, HitWeightId, ContentHitComment)
values
    (@ContentId, isnull(@HitCount,1), isnull(@HitWeightId,1), @ContentHitComment)
COMMIT TRAN

The ContentHit table has a clustered index on its ID column, and I've added another index on CreatedDate since that is used in the select.

When I profile the issue, I see the Stored proc executes for exactly 30 seconds, then the SQL timeout exception occurs. If it makes a difference the web application using it is ASP.NET, and I'm using Subsonic (3) to execute these stored procs.

Can someone please advise how best I can solve this problem? I don't care about reading dirty data...

EDIT: The MostPopularRead stored proc is called very infrequently - its called on the home page of the site, but the results are cached for a day. The pattern of events that I am seeing is when I clear the cache, multiple requests come in for the home site, and they all hit the stored proc because it hasn't yet been cached. SQL Server then maxes out, and can only be resolved by restarting the sql server process. When I do this, usually the proc will execute OK (in about 200 ms) and put the data back in the cache.

EDIT 2: I've checked the execution plan, and the query looks quite sound. As I said earlier when it does run it only takes around 200ms to execute. I've added MAXDOP 1 to the select statement to force it to use only one CPU core, but I still see the issue. When I look at the wait times I see that XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER are taking up a massive amount of wait time.

EDIT 3: I previously thought that this was related to Subsonic, our ORM, but having switched to ADO.NET, the erros is still live.

A: 

If you believe ContentHit to be the source of your problem, you could add a Covering Index

CREATE INDEX IX_CONTENTHIT_CONTENTID_HITWEIGHTID_HITCOUNT 
  ON dbo.ContentHit (ContentID, HitWeightID, HitCount)

Take a look at the Query Plan if you want to be certain about the bottleneck in your query.

Lieven
+1  A: 

The issue is likely concurrency, not locking. SOS_SCHEDULER_YIELD occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.

How often is [MostPopularRead] SP called and how long does it take to execute? The aggregation in your query might be rather CPU-intensive, especially if there are lots of data and/or ineffective indexes. So, you might end up with high CPU pressure - basically, a demand for CPU time is too high.

I'd consider the following:

  1. Check what other queries are executing while CPU is 100% busy? Look at sys.dm_os_waiting_tasks, sys.dm_os_tasks, sys.dm_exec_requests.

  2. Look at the query plan of [MostPopularRead], try to optimize the query. Quite often an ineffective query is the root cause of a performance problem, and query optimization is much more straightforward than other performance improvement techniques.

  3. If the query plan is parallel and the query is often called by multiple clients simultaneously, forcing a single-thread plan with MAXDOP=1 hint might help (abundant use of parallel plans is usually indicated by SOS_SCHEDULER_YIELD and CXPACKET waits).

Also, have a look at this paper: Performance tuning with wait statistics. It gives a pretty good summary of different wait types and their impact on performance.

P.S. It is easier to use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before a query instead of adding (nolock) to each table.

VladV
I agree with this answer. Try tuning the max degree of parallelism. Ironically, I've seen situations where *lowering* the max dop can improve performance.
Sean Reilly
Ok thanks I'll start looking into the MAXDOP and finding out more about what happens when CPU is at 100%.
Matt Roberts
MostPopularRead is cached once called, so should not be called often. However, because of this problem, while the site is waiting for the MostPopularRead to execute so it can be cached, loads more requests come in and fire the stored proc - which is when SQL Server hangs.If I run the stored proc manually from management studio, it never takes more than 0.5 secs, usually around 200 ms.
Matt Roberts
A: 

Hi By default settings sql server uses all the core/cpu for all queries (max DoP setting> advanced property, DoP= Degree of Parallelism), which can lead to 100% CPU even if only one core is actually waiting for some I/O.
If you search the net or this site you will find resource explaining it better than me (like monitoring your I/o despite you see a CPU-bound problem).
On one server we couldn't change the application with a bad query that locked down all resources (CPU) but by setting DoP to the half of the number of core we managed to avoid that the server get "stopped". The effect on the queries being less parallel was negligible in our case.

--
Dom

Dom Ribaut
+1  A: 

Remove the NOLOCK hint.

Open a query in SSMS, run SET STATISTICSIO ON and run the query in the procedure. Let it finish and post here the IO stats messages. Then post the table definitions and all indexes defined on them. Then somebody will be able to reply with the proper indexes you need.

As with all SQL performance problem, the text of the query is largely irrelevant without complete schema definition.

A guesstimate covering index would be:

create index ContentHitCreatedDate 
   on ContentHit (CreatedDate) 
   include (HitCount, ContentId,  HitWeightId);

Update

XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER: you can safely ignore all these waits. They show up because they represent threads parked and waiting to dispatch XEvents, Service Broker or internal SQL thread pool work items. As they spend most of their time parked and waiting, they get accounted for unrealistic wait times. Ignore them.

Remus Rusanu
Thanks Remus. But as I said earlier, when I run the query it runs extremely efficiently (200 ms), unless it hangs, so I wouldn't expect that a different index would solve this problem. The index you have added is the one i added to that table.I'll post this info if my Subsonic ORM tests don't reveal anything.
Matt Roberts
A: 

Thanks to all who posted, I got some great SQL Server perf tuning tips.

In the end we ran out time to resolve this mystery - we found a more effecient way to collect this information and cache it in the database, so this solved the problem for us.

Matt Roberts