views:

875

answers:

8

A client has reported repeated instances of Very strange behaviour when executing a stored procedure.

They have code which runs off a cached transposition of a volatile dataset. A stored proc was written to reprocess the dataset on demand if:
1. The dataset had changed since the last reprocessing
2. The datset has been unchanged for 5 minutes

(The second condition stops massive repeated recalculation during times of change.)


This worked fine for a couple of weeks, the SP was taking 1-2 seconds to complete the re-processing, and it only did it when required. Then...

  • The SP suddenly "stopped working" (it just kept running and never returned)
  • We changed the SP in a subtle way and it worked again
  • A few days later it stopped working again
  • Someone then said "we've seen this before, just recompile the SP"
  • With no change to the code we recompiled the SP, and it worked
  • A few days later it stopped working again


This has now repeated many, many times. The SP suddenly "stops working", never returning and the client times out. (We tried running it through management studio and cancelled the query after 15 minutes.)

Yet every time we recompile the SP, it suddenly works again.

I haven't yet tried WITH RECOMPILE on the appropriate EXEC statments, but I don't particularly want to do that any way. It gets called hundred of times an hour and normally does Nothing (It only reprocesses the data a few times a day). If possible I want to avoid the overhead of recompiling what is a relatively complicated SP "just to avoid something which "shouldn't" happen...


  • Has anyone experienced this before?
  • Does anyone have any suggestions on how to overcome it?


Cheers,
Dems.


EDIT:

The pseduo-code would be as follows:

  • read "a" from table_x
  • read "b" from table_x
  • If (a < b) return
  • BEGIN TRANSACTION
  • DELETE table_y
  • INSERT INTO table_y <3 selects unioned together>
  • UPDATE table_x
  • COMMIT TRANSACTION

The selects are "not pretty", but when executed in-line they execute in no time. Including when the SP refuses to complete. And the profiler shows it is the INSERT at which the SP "stalls"

There are no parameters to the SP, and sp_lock shows nothing blocking the process.

+2  A: 

This is the footprint of parameter-sniffing. Yes, first step is to try RECOMPILE, though it doesn't always work the way that you want it to on 2005.

Update: I would try statement-level Recompile on the INSERT anyway as this might be a statistics problem (oh yeah, check that automatics statistics updating is on).

If this does not seem to fit parameter-sniffing, then compare th actual query plan from when it works correctly and from when it is running forever (use estimated plan if you cannot get the actual, though actual is better). You are looking to see if the plan changes or not.

RBarryYoung
Unfortunately, there are no parameters. The only variable is the content of the source dataset being processed.
Dems
A: 

I totally agree with the parameter sniffing diagnosis. If you have input parameters to the SP which are varying (or even if they aren't varying) - be sure to mask them with a local variable and use the local variable in the SP.

You can also use the WITH RECOMPILE if the set is changing but the query plan is no longer any good.

In SQL Server 2008, you can use the OPTIMIZE FOR UNKNOWN feature.

Also, if your process involves populating a table and then using that table in another operation, I recommend breaking the process up into separate SPs and calling them individually WITH RECOMPILE. I think the plans generated at the outset of the process can sometimes be very poor (so poor as not to complete) when you populate a table and then use the results of that table to carry out an operation. Because at the time of the initial plan, the table was a lot different than after the initial insert.

Cade Roux
No parameters and the changes in the dataset are subtle, not significant. The execution plan shouldn't need to change at all. Which makes it so confusing. It's as if the execution plan changed when it shouldn't, like corrupt statistics. but we checked those too! *sigh*
Dems
I've added some notes about long processes that use multiple tables for intermediate results.
Cade Roux
In this case there are no intermediate steps. Just a DELETE then an INSERT, the UPDATE refered to in the pseudo code is just entering GetDate() into a meta-data control table.
Dems
A: 

Obviously changing the stored procedure (by recompiling) changes the circumstances that led to the lock.

Try to log the progress of your SP as described here or here.

devio
A: 

If you are any way doing these steps:

DELETE table_y INSERT INTO table_y <3 selects unioned together>

You might want to try this instead

DROP TABLE table_y SELECT INTO table_y <3 selects unioned together>

Raj

Raj
I think you mean TRUNCATE TABLE rather than DROP TABLE. Also, the security context invoking the stored procedure is unable to TRUNCATE the table, only DELETE works. Also, it's the INSERT that is the issue, not the clearing of the data.
Dems
But the fact that the data is changing mid-batch means that the execution plan chosen at the outset of the batch might be poor, hence my point about getting better execution plans piecewise.
Cade Roux
A: 

As others have said, something about the way the data or the source table statistics are changing is causing the cached query plan to go stale.

WITH RECOMPILE will probably be the quickest fix - use SET STATISTICS TIME ON to find out what the recompilation cost actually is before dismissing it out of hand.

If that's still not an acceptable solution, the best option is probably to try to refactor the insert statement.

You don't say whether you're using UNION or UNION ALL in your insert statement. I've seen INSERT INTO with UNION produce some bizarre query plans, particularly on pre-SP2 versions of SQL 2005.

  • Raj's suggestion of dropping and recreating the target table with SELECT INTO is one way to go.

  • You could also try selecting each of the three source queries into their own temporary table, then UNION those temp tables together in the insert.

  • Alternatively, you could try a combination of these suggestions - put the results of the union into a temporary table with SELECT INTO, then insert from that into the target table.

I've seen all of these approaches resolve performance problems in similar scenarios; testing will reveal which gives the best results with the data you have.

Ed Harper
A: 

I would agree with the answer given above in a comment, this sounds like an unclosed transaction, particularly if you are still able to run the select statement from query analyser.

Sounds very much like there is an open transaction with a pending delete for table_y and the insert can't happen at this point.

When your SP locks up, can you perform an insert into table_y?

Paddy
Th only code that ever writes to that table is the SP in question. As the code is BEGIN TRANSACTION, DELETE <table>, INSERT INTO <table>, UPDATE <meta_data table>, COMMIT TRANSACTION I can't see where the scenario described could arise. Especially since sp_lock doesn't show any lock contentions.
Dems
Just out of interest - what happens if this procedure gets run twice - while the first procedure is still executing?
Paddy
A: 

Do you have an index maintenance job?

Are your statistics up to date? One way to tell is examine the estimated and actual query plans for large variations.

Mitch Wheat
IBM own the DBA roll for all of out client's SQL SERVER instances. The indexes are maintained in a nightly process. I'll have to check ACTUAL vs ESTIMATED plans when it's working, and then check the ESTIMATED when it's 'broken'. I can't get the ACTUAL for when it's broken as it seemingly never returns. And, no, we can't leave a locked table for hours waiting for it to retun on a live system :)
Dems
A: 

As others have said, this sounds very likely to be an uncommitted transaction.

My best guess:

You'll want to make sure that table_y can be deleted completely and quickly.

If there are other stored procedures or external pieces of code that ever hold transactions on this table, you may be waiting forever. (They may error out and never close the transaction)

Another note: try using truncate if possible. it uses fewer resources than a delete with no where clause:

truncate table table_y

Also, once an error happens within your OWN transaction, it will cause all following calls (every 5 minutes apparently) to "hang", unless you handle your error:

begin tran
begin try
 -- do normal stuff
end try
begin catch
 rollback
end catch
commit

The very first error is what will give you information about the actual error. Seeing it hang in your own subsequent tests is just a secondary effect.

Jeff Meatball Yang