views:

221

answers:

1

I am responsible for a third-party application (no access to source) running on IIS and SQL Server 2005 (500 concurrent users, 1TB data, 8 IIS servers). We have recently started to see significant blocking on the database (after months of running this application in production with no problems). This occurs at random intervals during the day, approximately every 30 minutes, and affects between 20 and 100 sessions each time. All of the sessions eventually hit the application time out and the sessions abort.

The problem disappears and then gradually re-emerges. The SPID responsible for the blocking always has the following features:

  • WAIT TYPE = ASYNC_NETWORK_IO
  • The SQL being run is “(@claimid varchar(15))SELECT claimid, enrollid, status, orgclaimid, resubclaimid, primaryclaimid FROM claim WHERE primaryclaimid = @claimid AND primaryclaimid <> claimid)”. This is relatively innocuous SQL that should only return one or two records, not a large dataset.
  • NO OTHER SQL statements have been implicated in the blocking, only this SQL statement.
  • This is parameterized SQL for which an execution plan is cached in sys.dm_exec_cached_plans.
  • This SPID has an object-level S lock on the claim table, so all UPDATEs/INSERTs to the claim table are also blocked.
  • HOST ID varies. Different web servers are responsible for the blocking sessions. E.g., sometimes we trace back to web server 1, sometimes web server 2.

When we trace back to the web server implicated in the blocking, we see the following:

  • There is always some sort of application related error in the Event Log on the web server, linked to the Host ID and Host Process ID from the SQL Session.
  • The error messages vary, usually some sort of SystemOutofMemory. (These error messages seem to be similar to error messages that we have seen in the past without such dramatic consequences. We think was happening before, but didn’t lead to blocking. Why now?)
  • No known problems with the network adapters on either the web servers or the SQL server.

(In any event the record set returned by the offending query would be small.)

Things ruled out:

  • Indexes are regularly defragmented.
  • Statistics regularly updated.
  • Increased sample size of statistics on claim.primaryclaimid.
  • Forced recompilation of the cached execution plan.
  • Created a compound index with primaryclaimid, claimid.
  • No networking problems.
  • No known issues on the web server.
  • No changes to application software on web servers.

We hypothesize that the chain of events goes something like this:

  1. Web server process submits SQL above.
  2. SQL server executes the SQL, during which it acquires a lock on the claim table.
  3. Web server process gets an error and dies.
  4. SQL server session is hung waiting for the web server process to read the data set.
  5. SQL Server sessions that need to get X locks on parts of the claim table (anyone processing claims) are blocked by the lock on the claim table and remain blocked until they all hit the application time out.

Any suggestions for troubleshooting while waiting for the vendor's assistance would be most welcome.

Is there a way to force SQL Server to lock at the row/page level for this particular SQL statement only? Is there a way to set a threshold on ASYNC_NETWORK_IO waits only?

+1  A: 

ASYNC_NETWORK_IO is caused by clients not able to receive data quick enough and filling network buffers (simply put). There is no magic SQL Server setting to fix it.

  • reboot the client (even if it's web server)
  • ensure NICs are set correctly (firmware, full duplex etc)
  • ensure physical cables are ok (any packet losses etc?)
  • etc

It is not a SQL Server issue, as such...

ASYNC_NETWORK_IO Occurs on network writes when the task is blocked behind the network. Verify that the client is processing data from the server.

gbn
Thanks for the quick and informative response. We rechecked the adapters/physical network connections on all of the web server and believe that we can rule this out. The SQL statement that is implicated in the blocking will normally return a very small dataset (max 3 records), not enough to overflow network buffers and produce a prolonged ASYNC_NETWORK_IO wait.However, there is a boundary condition (@claimid = '') which would return millions of records. This could very well induce ASYNC_NETWORK_IO, even on a properly configured web server. This is what we will pursue next.
ivankolo