views:

36

answers:

2

Woke up this morning with a page that our cluster was down. It came back up right away. I found log error logs with entries about IO taking longer than 15 seconds. Our monitoring server had tried to ping the server and had a timeout error.

I checked one of our monitoring tools to see what was going on at 4:30 in the morning. It seems to be statistics being updated on one of our large databases. The tool shows our disk being maxxed out. I see very high % busy times for one of the disks.

Now sqlagent is progressing through every other database alphabetically doing the same thing. We do have auto update stats on - but I thought that happened on a as-needed basis. I don't have any statistics update jobs enabled right now(that I know of - and the job monitor doesn't show any running jobs), so I'm not really sure whats causing this.

http://support.microsoft.com/default.aspx?scid=kb;en-us;195565 - confirms my thoughts on the as-needed nature of autostats.

The same thing also happened last night around 6:30pm - on the same large database - a few select statsman from... statements.

The disks are on a SAN and we're running the latest version of sql 2005.

A: 

If you are getting 15 second io errors, I would start the diagnosis at a lower level, check if a driver relating to the io has recently been updated e.g. Powerpath emulex etc. When I have encountered this error before it was cause by faulty io subsystems and was not directly SQL, that was the component that put the disk under load and revealed the issue.

Andrew
We're pulling 300MB/s, so we'll be looking into this.
Sam
A: 

The 15 seconds error is not always correct, sometimes is caused by CPU time drift, see Event ID 833: I/O requests taking longer than 15 seconds. Validate that I/O requests are indeed taking that long (note that OS perfmon counters suffer from the same time drift issue).

Out of date statistics are everyones favourite and first thing to blame for any performance problem, but in truth they are seldom the root cause. Bad statistics can be diagnesd by investigating the execution plan of problem queries, they show up as a significant discrepancy between the estimated number of rows and the actual number of rows on range and scan operators.

If you believe you must do a full update statistics every night (I doubt), then your I/O subsystem must be planned to support the capacity required, if the database is alrge an update statistics with full scan will have to read the entire database once, so plan accordingly, including the I/O bandwith from SAN to SQL.

Remus Rusanu
I don't have any update stats jobs enabled. I saw the update stats commands running through every database on every table! I don't know what is causing this.
Sam