views:

398

answers:

13

Hi there,

I have been working on a stored procedure performance problem for over a week now and is related to my other post on Stackoverflow here. Let me give you some background information.

We have a nightly process which runs and is started by a stored procedure which calls many many many other stored procedures. Lots of the called stored procedures call others, etc. I have looked at some of the called procs and there is all sorts of frightnening complicated stuff in there such as XML string processing, unnecessary over-use of cursors, NOLOCK hints over-used, rare use of set-based processing, etc - the list goes on, it's quite horrendous.

This nightly process in our production environment takes on average 1:15 to run. It sometimes takes 2 hours to run which is unacceptable. I have created a test environment on identical hardware to production and run the proc. It took 45 minutes the first time I ran it. If I restore the database to the exact same point and run it again, it takes longer: indeed, if I repeat this action several times (restoring and re-running), the proc takes progressively longer until it plateaus at around 2 hours. This really puzzles me because I restore the database to the exact same point every time. There are no other user databases on the server.

I thought of two lines of investigation to pursue:

  1. Query plans and parameter spoofing
  2. Tempdb

As a test, I restarted SQL Server to clear out both the cache and tempdb and re-ran the proc with the same database restore. The proc took 45 minutes. I repeated this several times to ensure that it was repeatable - again it took 45 minutes each time. I then embarked on several tests to try and isolate the puzzling increase in run times when SQL Server does not get restarted:

  1. Run the initial stored procedure WITH RECOMPILE
  2. Before running the procedure, executre DBCC FREEPROCCACHE to clear out the procedure cache
  3. Before running the procedure, execute CHECKPOINT followed by DBCC DROPCLEANBUFFERS to ensure that the cache was empty and clean
  4. Executed the following script to ensure all stored procedures were marked for recompilation:

    DECLARE @proc_schema SYSNAME
    DECLARE @proc_name SYSNAME
    
    
    DECLARE prcCsr CURSOR local
        FOR SELECT  specific_schema,
                    specific_name
            FROM    INFORMATION_SCHEMA.routines
            WHERE   routine_type = 'PROCEDURE'
    
    
    OPEN prcCsr
    
    
    FETCH NEXT FROM prcCsr INTO @proc_schema, @proc_name
    
    
    DECLARE @stmt NVARCHAR(MAX)
    WHILE @@FETCH_STATUS = 0
        BEGIN
            SET @stmt = N'exec sp_recompile ''[' + @proc_schema + '].['
                + @proc_name + ']'''
    --        PRINT @stmt   -- DEBUG
            EXEC ( @stmt
                )
    
    
    
        FETCH NEXT FROM prcCsr INTO @proc_schema, @proc_name
    END
    

In all the above tests, the procedure takes longer and longer to run with the same database restore. I am really at a loss now as to what to try. Looking into the code at this point is an option but realistically its going to take 3-6 months to get that optimised as there is lots of room for improvement there. What I am really interested in getting to the bottom of, is why does the proc execution time get longer each time when a database restore has been performed even when the procedure and buffer caches have been cleaned?

I did also investigate tempdb, and try and clear out old tables in there as described in my other stackoverflow post, but I am unable to manually clear out temp tables that were created from table variables manually and they don't seem to want to disappear on their own (even after leaving them for 24 hours).

Any insight or suggestions for further testing would be greatly appreciated. I am running SQL Server 2005 SP3 64-bit Enterprise edition on a Windows 2003 R2 Ent. edition cluster.

Regards, Mark.

A: 

What does the overall process do, what is the purpose of the operation being performed?

I would assume that executing the process results in data modification within the database. Is this the case?

If this is the case, then each time you run the process, the data begin considered is different and so different execution plan production is a possibility and so too are differing execution times.

Assuming that modification to the database data is occuring then you should also investigate:

  • Updating relevant database statistics between each process run.
  • Reviewing the level of index fragmentation between each process run and determine if defragmentation could prove benificial.
John Sansom
John, thanks for your reply. The database is restored every time it is run in my tests. Please re-read the original text.
Mark Allison
Ah yes, indeed. In that case then, the next logical step to me would be to validate that the same execution plans are indeed being used. I suspect they are but it needs to be confirmed as part of your investigations. Given the volume of procedures executed by the process, you may need to inspect the plan cache in order to validate this. Also, how large is the database (GB) and what type of storage are you using
John Sansom
John, the complexity of the process is so big, that inspecting the query plans caused my SSMS to generate out of memory errors. What were you thinking of regarding inspecting the plan cache? Do you have any good strategies? The process generates thousands and thousands of query plans, because there are thousands of queries being run recursively (mainly within cursors).
Mark Allison
+2  A: 

One thing that could cause this is if the process is leaking XML documents. That would cause SQL Server to use more memory, and parts of that might be written to a page file on disk, causing the process to slow down.

Code that creates an XML document looks like:

EXEC sp_xml_preparedocument @idoc OUTPUT, @strXML

It leaks if there is no corresponding:

EXEC sp_xml_removedocument @idoc

XML documents are COM objects stored outside the configured SQL Server memory. Even if you set SQL Server to use max 5 GB, leaking XML documents grows memory usage beyond that.

Andomar
I strongly suspect this might be the cause because lots of the procs used in the process use sp_xml_preparedocument. I'll have a scan through sys.sql_modules and see if the sp_xml_removedocument is specified for every case. Thanks for your reply.
Mark Allison
Andomar, whilst I had strong hopes for this - I checked all the stored procedures and every time sp_xml_preparedocument is called, there is a corresponding sp_xml_removedocument. So, I'm still searching for a solution to the increasing run times, with a static data set.
Mark Allison
Are you monitoring the memory usage? Look for working set, private working set, and commit size.
Andomar
Thanks for the suggestion Andomar - which performance monitor counters are they? I could only see Committed Byes which is the amount of committed virtual memory.
Mark Allison
You can find them in the process tab of task manager (view -> select columns if they're not there yet.) Or sysinternals process explorer, a free download. I don't know the equivalent performance monitor counters.
Andomar
A: 

Apparently you want to try anything except what you really have to do which is fix the process. Start by getting rid of the cursors. If it takes two hours right now, without the cursors I'll bet you can get it down to less than ten minutes.

HLGEM
You haven't how many cursors there are. Doing this would take months, believe me.
Mark Allison
Start one at a time, every one you remove will imporve the time. Pick the one which processes the largest number of records first.
HLGEM
I agree. Instead of patching the process, it would be better to fix it the right way.
G Mastros
Guys, this is not an option. I'm very aware that removing cursors will improve things. I did not develop the system, I am simply a production DBA trying to improve performance of a troublesome system. I do not understand the data model, or the processes within the procs to make this change, and if you saw the thousands upon thousands of lines of dreadful SQL code, you would appreciate my situation.
Mark Allison
Fixing the code is a developer way to look at things. It doesn't always make business sense. Developer hours are better spend supporting new products/services than perfecting existing processes that kinda sorta work.
Andomar
Also, tackling the cursors does not explain the behaviour I described in my problem. What I really want to find out is why does the process get slower for each run of the proc - after doing a restore to the same point, and freeing the procedure cache? Cursors are not the cause of this behaviour.
Mark Allison
Getting rid of cursors without understanding the cause of the problem first sounds to me just like another case of premature optimisation - What after Mark spends ages fixes all of the cursors just to find that the cause of the slowdown was something else?
Kragen
The OP is aware of the "correct" way to cleann up the code, however he rightly curious as to why the same code should run slower afer a restore -- which should return the DB to an original repeatable state. This is stated clearly in the post, and, this beheaviour is so wierd that I for one would like to identify the cause.
James Anderson
A: 

I would log information into a log_table and the time it took to run each steps...that will help you narrow down the issue and also help you progressively improve the process by tackling it one at time (from improving procs that take the longest).

Best way is to simply insert at the beginning and the end of each proc.

J_T
Thanks J_T, this is my next strategy - just got to figure out a way to insert some text at the beginning and end of each proc, and also generate a dependency tree from the initial caller proc.
Mark Allison
You can capture this in a trace, and you shouldn't have to manually add all sorts of logging. See my answer.
BradC
A: 

Cursors are not peformance boosters, others address that. (not your decision)

Look into the temp tables use/management. Are they global temp tables or session/local temp tables? The fact that they are hanging around looks interesting. The tempdb is locked when temp tables are created which might be part of the issue.

Local temp tables (#mytable syntax) should go away when the session goes out of scope, but you SHOULD have dropped these (release early) to free up resources.

Use of local temp tables in transaction then cancel without COMMIT/ROLLBACK can increase locking in tempdb causing performance issues. Speaking of transactions - this will cause locks on syscolumns, sysindexes etc. if temp tables are created in transactions - thus other exeuctions are blocked from using the same query.

Use of temp tables created by calling procedures in the called procedures points to logic need - rethink and try to use relational structures instead.

IF you need temp tables (to eliminate cursors :) then avoid SELECT INTO - to avoid system objects locks.

Use of global temp tables (##myglobaltable syntax) should be avoided as multiple sessions accessing can be and issue (the table hangs around until all sessions clear), and for me at least, makes no additive logical value proposition (look into the use of a permanent table instead). Question if global, are there blocking procedures?

Are there a lot of sparse temp tables (grow with large data, but have smaller data sets in them?)

Microsoft SQL Server Book Online, “Consider using table variables instead of temporary tables. Temporary tables are useful in cases when indexes need to be created explicitly on them, or when the table values need to be visible across multiple stored procedures or functions. In general, table variables contribute to more efficient query processing.”

Of course if the temp table needs indexes, tabel variables are not an option.

Mark Schultheiss
Note that sometimes table variables barf for no apparent reason, and switching to a temp table fixes the problem. I had this happen recently. A stored procedure would fail after running for 20 minutes when it used to run quickly. I switched from a table variable to a temp table (and no other change) and it went back to the usual minute and a half and success.
Emtucifor
@Emtucifor, yes, the selection has to be made with care and consider the scope, try to keep table variables to low row counts (100?) for instance.
Mark Schultheiss
+2  A: 

Reviewing all posts to-date and your related question, it certainly sounds like your strongest lead is the mystery behind those tempdb objects. Some leading questions:

  • After a fresh start, after the process is run how many objects are in tempdb? Is it the same number after every fresh start?
  • Do the numbers grow after “successive” runs? Do they grow at the same rate?
  • Can you determine if they occupy space?
  • For that matter, your tempdb files grow with each successive run of your process?

I followed the links, but didn’t find any reference discussion the actual problem. You might want to raise the issue on the Microsoft SQL Technet forums here -- they can be pretty good with the abstract stuff. (If all else fails, you can open a case with MS technical support. It might take days, but odds are very good that they will figure things out. And if it is an MS bug, they refund your money!)

You've said that rewriting the code is not an option. However, if temp table abuse is a factor, identifying and refactoring those parts of the code first might help a lot. To find which those may be, run SQL Profiler while your process executes. This kind of work is, alas, subjective and highly iterative (meaning you hardly ever get just the right set of counters on the first pass). Some thoughts:

  • Start with tracking SP:Started, to track which stored proedures are being called.
  • SQL Profiler can be used to group data; it’s awkward and I’m not sure how to describe it in mere text, but configured properly you’ll get a Profiler display showing the number of times each procedures was. Ideally, this would show the most frequenly called procs, and you can analyze them for temp table abuse and refactor as necessary.
  • If nothing jumps out there, you can trace SP:StmtStarting and do the same thing for individual statements. The problem here is that in a 2+/- hour spaghetti-code run, you might run out of disk space, and analyzing 100s of MB of trace data can be a nightmare. (Hint: load it in a table, build indexes, then carefully delete out the cruft.) Again, the goal would be to identify overly used/abused temp table code to be refactored.
Philip Kelley
Hi Philip. 1124 objects remain in tempdb, all created from table variables. It is the exact same number each time. I performed a server-side trace and posted my results to that as a reply to the question.
Mark Allison
A: 

I don't have the answer but some ideas of what I would do to isolate issues like this.

First, I would take snapshots of sys.dm_os_wait_stats before and after each execution. You subtract the 2 snapshots (get a deltas) and see if any particular WAIT is prominent or gets worse with each run. An easy way to calculate deltas is to copy the sys.dm_os_wait_stats values into Excel worksheets and use VLOOKUP() to subtract corresponding values. I've used this investigation technique hundreds of times. You don't know what aspect SQL Server is hung up on?! Let SQL Server "tell" you via sys.dm_os_wait_stats !

The other thing I might try is to adjust the behavior of the loop to understand if the subsequent slower executions exhibit constant throughput for all records from beginning to end or does it only slow down for particular sproc(s) in INFORMATION_SCHEMA.routines ... 2 techniques for exploring this is:

1) Add a "top N" clause the SQL SELECT such as "top 100" or "top 1000" (create an artificial limit) to see if you get subsequent slowdowns for all record count scenarios ... or ... do you only get the slowdowns when the cursor resultset is large enough to include the offending sproc.

2) Instead of adding "top N", you can add more print statements (instrumentation) to calculate the throughput as it is processing.

Of course, you can do combination of both.

Maybe these diagnostics will get you closer to the root cause.

Edited to add: Btw, SQL2008 has a new performance monitor that makes it easy to "eyeball" the numbers of sys.dm_os_wait_stats. However for SQL2005, you'll have to manually calculate the deltas via Excel or a script.

JasDev
+1  A: 

Mark-

So it might take 3-6 months to totally re-write this procedure, but that doesn't mean you can't do some relatively quick performance optimization.

Some of the routines I have to support run 30hrs+, I would be ecstatic to get them to run in 2hrs!! The kind of optimization that you do on these routines is a little different than your normal OLTP database:

  1. Capture a trace of the entire process, making sure to capture SP:StmtCompleted and SQL:StmtCompleted events. Make sure to put a filter on Duration (>10ms or something) to eliminate all the quick, unimportant statements.

  2. Pull this trace into a table, and do some filtering/sorting/grouping, focusing on Duration and Reads. You will likely end up with one of two situations:

    (A) A handful of individual queries/statements are responsible for the bulk of the time of the procedure (good news)

    (B) A whole lot of similar statements each take a short amount of time, but together they add up to a long time.

In scenario (A), just focus your attention on these queries. Optimize them using indexes, or using other standard techniques. I highly recommend Dan Tow's book "SQL Tuning" for a powerful technique to optimize queries, especially messy ones with complicated joins.

In scenario (B), step back a bit and look at the set of statements as a whole. Are they all similar in some way? Can you add an index on a key, common table that will improve them all? Can you eliminate a loop that executes 10,000 dynamic queries, and instead do a single set-based query?

Still two other possibilities, I suppose:

(C) 15,000 totally different dynamic SQL statements, each requiring its own painstaking optimization. In this case, try to focus on server-level optimizations, such as I/O based improvements that will benefit them all.

(D) Something else weird going on with TempDB or something mis-configured on the server. Not much else I can say here, other than find the problem, and fix it!

Hope this helps.

BradC
Great reply Brad - see my answer to my own question.
Mark Allison
+1  A: 

Run the following script at start of test and then after each iteration:

select sum(single_pages_kb) as sum_bp_kb
  , sum(multi_pages_kb) as sum_va_kb
  , type
from sys.dm_os_memory_clerks
group by type
having sum(single_pages_kb+multi_pages_kb) > 16
order by sum(single_pages_kb+multi_pages_kb) desc

select sum(total_pages), type_desc
from tempdb.sys.allocation_units
group by type_desc;

select * from sys.dm_os_performance_counters
where counter_name in (
  'Log Truncations'
  ,'Log Growths'
  ,'Log Shrinks'
  ,'Data File(s) Size (KB)'
  ,'Log File(s) Size (KB)'
  ,'Active Temp Tables');

If the results are not self-evident, you can post them somewhere and place a link here, I can look into them and see if something strikes as odd.

Remus Rusanu
Thanks Remus - I will try this in the future when I have time - I have other work to do right now. Looks like it might isolate things for me, and I learned something from your queries!
Mark Allison
+2  A: 

Can you try the following scenario on the test server:

  1. Make two copies of the database on the server: [A] and [B]. [A] is the database in question, [B] is the copy.
  2. Restart server
  3. Run your process
  4. Drop the database [A]
  5. Rename [B] to [A]
  6. Run your process

This would be like a hot database swap. If the second run takes longer, something on the server level is happening (tempdb, memory, I/O, etc). If the second run takes about the same time, then the problem is on the database level (locks, index fragmentation, etc).

Good luck!

Dan S
A: 

These are long shots:

  • Quickly look through all of the stored procedures for things that are unusual and SQL Server should not really be doing, for example sending email or writing files, etc. SQL trying to send email to a non-exist email server could cause delays.
  • The other thing to keep in mind is that as you restore the database before each test possibly your disk is getting more fragmented (not really sure about this though). So that may explain why run times get longer each time until they plateau.
Jeff Widmer
+1  A: 

Here's another factor for you to consider - OUTSIDE the database:

and this could potentially be a LOT quicker and easier to optimise than sifting through your entire business logic in the database sprocs

Drive Fragmentation

Your like-for-like tests on a new system, showing gradual performance degradation the more times you run the code, hints at a fragmentation issue...

On a RAID array, fragmentation performance degradation is compounded and if your database isn't able to work off its buffer cache (memory) then you'll be hitting your disk a lot. One way to check for this:

  • Open perfmon
  • Add a disk queue counter (under Physical Disk) for the drive that the database resides on

With disk queues - anything above zero means that you're waiting for a previous operation to complete and your disk is holding your process up. So if there is a queue for more than 25-30% of the time, you need to figure out why. Either the disks are too slow, or your fragmentation has snowballed.

NTFS is notoriously bad for managing fragmentation, especially when a disk becomes more than 70% full.

If you can move your database logs to a different drive because these grow and shrink (similar to adding / deleting files), that should help your fragmentation problem somewhat.

In addition to external fragmentation at the file-system level (NTFS), you can also get index fragmentation inside your database from lots of insert operations. You should consider rebuilding your indexes at least weekly, and look into the FILL_FACTOR and PAD_INDEX options - you will need both. Also consider using the SORT_IN_TEMPDB option and move your temp database to another drive.

Neil Fenwick
Mark Allison
Hi Mark,if your procedures perform a lot of insert operations - still have a look into the PAD_INDEX and FILL_FACTOR of the indexes. That leaves space for new data, otherwise for an INSERT operation the database first has to move data around to make space. It could be the index fill-factor that is killing your performance...
Neil Fenwick
A: 

Firstly, thanks to everyone for some really great help. I much appreciate your time and expertise in helping me to solve this very strange issue. I have an update.

I started a server-side trace to try and isolate the stored procs that were running slower between iterations. What I found surprised me. 96 stored procedures are involved in the process. Most of these stored procedures ran slower the second time around - about 50 of them. The rest were very quick to run and didn't influence the overall time at all, and in fact some of these ran a little quicker (as would be expected).

I failed over the database instance to another node in my cluster and ran the tests there with the exact same results - so I can rule out any OS differences between cluster nodes - when building the clusters I was very conscious to build them identically.

1100 temp tables get created during the process and persist after it has finished - these are all table variables and I found a way to remove them. Running sp_recompile on every proc and function in the database caused all the temp tables to get cleared up. However this did not improve the run times at all. The only thing that helps the run times is a restart of the SQL Server service. Unfortunately I am out of time now to investigate this further - I have other work to do, but would like to persist with it. Perhaps I will come back to it later if I get a spare few hours. In the meantime however, I have to admit defeat with no solution and no bounty to give.

Thanks again everyone.

Mark Allison