views:

221

answers:

9

I can't figure this one out. On SQL Server I have a process that is run dozens of times per second (data being sent to the server). The process runs great, processing requests takes between 50ms and 200ms. Then, roughly (but sporadically) once every 1.5 minutes all requests suddenly take 15000ms to 22000ms (15 to 22 seconds). At the same time CPU usage on the server drops drastically. Sometimes (about 70% of the time) average disk queue length spikes just before the cpu drops and the requests slow down.

I am watching CPU on perfmon, it usually jumps between 20% and 70%, with average CPU of about 50%. When things stop, it drops to 0%, with a couple of 20% spikes, for about 20 seconds.

At the same time I am watching SQL activity monitor. Usually there are between 1 and 4 EXECUTE transactions listed, but when this happens the EXECUTE transactions start to go up, to 20 or 30 transactions. Transactions are coming in, but not being processes.

I check for blocks and never see any:

Select A.*
       From master.dbo.sysprocesses as A with (nolock)
       Where A.blocked <> 0

Note that I am running in "snapshot isolation"

I have the system recording deadlock conditions to the error log, none reported.

I checked SQL agent for other processes that are running, none scheduled at the time these events happen.

I watch SQL Profiler for other events coming in, there was nothing. I also watched for File Growth events and it reports nothing.

Even when the requests are taking 20000ms, SQL Profiler reports reads under 2000 and cpu under 50. The processes themselves do not appear to be consuming resources. However the logout events are reporting high reads and cpu (I am not sure if that is at all relevant).

There is also nothing in my event log at the time of these events.

Any ideas? Any other place to look?

Running SQL Server 2005 Standard on Window 2003 32bit.

A: 

Are you using fulltext search?

I'm thinking that there may be some index rebuilding going on every now and then.

Perhaps try automating full rebuilds of indexes, or changing to non clustered indexes?

rizzle
Thanks Rizzle. However, I'm not using full text search.
Mike Brad
A: 

I would add in a few more counters in your perfmon, like maybe reads and writes per second. From here you can see if it's an I/O issue. Also check out this MSDN entry on SQL performance. It really gave some good ideas on things to check out for me at least.

Tim Meers
I guess I am in bad shape. % Disk Avg 633 (can't explain that). Avg disk sec/read .042 Avg disk sec/write .052disk reads/sec 2.041disk writes/sec 71It is a parity raid, but I think those numbers are out of the ballpark. would you agree?
Mike Brad
Well not knowing your RAID level and number of disks it's hard to say if disk IO is the issue. I have a RAID 5 array with 4 disks so I'd use this to calculate the IOPS: reads + (4 * Writes)) / Number of disks = total IO/s. Under my typical load with numbers punched it it looks like this: (724.364 + (4 * 5.707)) / 4 = 186.798. I have a lot more reads than writes, but you seem to have a lot of writes, but nothings to terrible, like Chris said might be an issue with the array. I'd check that before spending any time on the code.
Tim Meers
But then again, I normally look at the hardware first since I'm better at the server side of things than coding.
Tim Meers
+2  A: 

Have you checked the drive for errors? It sounds like maybe there is something going on. If it's a RAID array, check the health of the array.

Chris Lively
Will do (I will put the ISM on it). Thanks.
Mike Brad
A: 

what is the wait_type, wait_resource and wait_time of sys.dm_exec_requests for the long running requests (sample periodically)? Do these requests spawn sub tasks (sys.dm_os_tasks)? What are those tasks doing?

Remus Rusanu
Generally for processes that don't look like system processes, waittype is null and waittime is 0. During one of the incidents I queried dm_exec_requests and did see one transaction with OLEDB (waittime 15) and one with waittype WRITELOG (waittime 0). I will have to research what this means.Not sure what to look for in dm_os_tasks
Mike Brad
WRITELOG means the requests has commited a transaction and is waiting for the log to be hardened (written to disk). OLEDB is a distributed query wait. In sys.dm_os_tasks you should look for task_state. PENDING would indicate a scheduler bottleneck (all workers are occupied)
Remus Rusanu
Mike Brad
A: 

Have you checked your memory consumption? Windows Server 2003 R2 sometimes basically restarts all memory allocations under intense load. When this happens, SQL Server is forced down to a minimal amount of memory (4MB or so) and then slowly reallocate memory to the server until it returns to relatively normal levels. We've seen this happen when very large files are copied across our SAN. I've heard this can be triggered by a transaction log backup process if the transaction logs are very large and the server is under extremely heavy usage.

Registered User
Looking at task manager (not sure that is the best way) I see the Sqlservr.exe process reporting about 2,544,000 mem usage. It fluctuates a little but never drops considerably (even through an incident).
Mike Brad
A: 

Mike, See my blog post Unexplained SQL Server Timeouts and Intermittent Blocking. Especially if your stored proc has a "SELECT INTO" or deletes from a temp table.

Jim

JBrooks
As a rule we use table variables (not temp tables) which are defined before any data is inserted into them. I will sift through the whole process and check again to be sure.
Mike Brad
A: 

It's not slow code because the delay doesn't increase the CPU time. It sounds like the server is making a blocking call that is not succeeding, and then it eventually times out. You've ruled out deadlocks. If it was a hard drive problem, you'd expect to see something in the event log.

Try installing a network sniffer such as Wireshark to see if there is anything interesting happening at the time the pause begins.

jdigital
A: 

One option: statistics update. If you're writing often enough, you may hit the recompute threshold.

Look at this article "Index Statistics on MSDN" and the option "AUTO_UPDATE_STATISTICS_ASYNC"

Although every 90 seconds is a bit much...

gbn
+1  A: 

The issue is the automatic Checkpoint. When SQL server runs the automatic checkpoint, other transactions are delayed, this is probably related to the disk i/o involved in the checkpoint.

dm_exec_requests showing a waittype WRITELOG (waittime 0) means the requests has committed a transaction and is waiting for the log to be hardened (written to disk) --Remus Rusanu

To verify this, I turned on checkpoint logging, and recorded a perfmon session during several of the incidents. I then compared the log to the perfmon to see that the incidents were always related to checkpoint in one of my databases.

DBCC TRACEON(3502, -1) --turn on checkpoint logging

DBCC TRACEOFF(3502, -1) --turn off checkpoint logging

EXEC xp_readerrorlog --read the log

SELECT DB_Name([dbid]) as [Database Name] --verify the database id mentioned in the log

That particular database has one process that produces a lot of inserts and deletes. The solution is to re-write that process to reduce the amount of data being recorded. Another option would be to add hardware.

Thanks to all who contributed.

Mike Brad