I've been analyzing a recurring "bug report" (perf issue) in one of our systems related to a particularly slow delete operation. Long story short: It seems that the CASCADE DELETE
keys were largely responsible, and I'd like to know (a) if this makes sense, and (b) why it's the case.
We have a schema of, let's say, widgets, those being at the root of a large graph of related tables and related-to-related tables and so on. To be perfectly clear, deleting from this table is actively discouraged; it is the "nuclear option" and users are under no illusions to the contrary. Nevertheless, it sometimes just has to be done.
The schema looks something like this:
Widgets
|
+--- Anvils [1:1]
| |
| +--- AnvilTestData [1:N]
|
+--- WidgetHistory (1:N)
|
+--- WidgetHistoryDetails (1:N)
Column definitions look like the following:
Widgets (WidgetID int PK, WidgetName varchar(50))
Anvils (AnvilID int PK, WidgetID int FK/IX/UNIQUE, ...)
AnvilTestData (AnvilID int FK/IX, TestID int, ...Test Data...)
WidgetHistory (HistoryID int PK, WidgetID int FK/IX, HistoryDate datetime, ...)
WidgetHistoryDetails (HistoryID int FK/IX, DetailType smallint, ...)
Nothing too scary, really. A Widget
can be different types, an Anvil
is a special type, so that relationship is 1:1 (or more accurately 1:0..1). Then there's a large amount of data - perhaps thousands of rows of AnvilTestData
per Anvil
collected over time, dealing with hardness, corrosion, exact weight, hammer compatibility, usability issues, and impact tests with cartoon heads.
Then every Widget
has a long, boring history of various types of transactions - production, inventory moves, sales, defect investigations, RMAs, repairs, customer complaints, etc. There might be 10-20k details for a single widget, or none at all, depending on its age.
So, unsurprisingly, there's a CASCADE DELETE
relationship at every level here. If a Widget
needs to be deleted, it means something's gone terribly wrong and we need to erase any records of that widget ever existing, including its history, test data, etc. Again, nuclear option.
Relations are all indexed, statistics are up to date. Normal queries are fast. The system tends to hum along pretty smoothly for everything except deletes.
Getting to the point here, finally, for various reasons we only allow deleting one widget at a time, so a delete statement would look like this:
DELETE FROM Widgets
WHERE WidgetID = @WidgetID
Pretty simple, innocuous looking delete... that takes over 2 minutes to run, for a widget with no data!
After slogging through execution plans I was finally able to pick out the AnvilTestData
and WidgetHistoryDetails
deletes as the sub-operations with the highest cost. So I experimented with turning off the CASCADE
(but keeping the actual FK, just setting it to NO ACTION
) and rewriting the script as something very much like the following:
DECLARE @AnvilID int
SELECT @AnvilID = AnvilID FROM Anvils WHERE WidgetID = @WidgetID
DELETE FROM AnvilTestData
WHERE AnvilID = @AnvilID
DELETE FROM WidgetHistory
WHERE HistoryID IN (
SELECT HistoryID
FROM WidgetHistory
WHERE WidgetID = @WidgetID)
DELETE FROM Widgets WHERE WidgetID = @WidgetID
Both of these "optimizations" resulted in significant speedups, each one shaving nearly a full minute off the execution time, so that the original 2-minute deletion now takes about 5-10 seconds - at least for new widgets, without much history or test data.
Just to be absolutely clear, there is still a CASCADE
from WidgetHistory
to WidgetHistoryDetails
, where the fanout is highest, I only removed the one originating from Widgets
.
Further "flattening" of the cascade relationships resulted in progressively less dramatic but still noticeable speedups, to the point where deleting a new widget was almost instantaneous once all of the cascade deletes to larger tables were removed and replaced with explicit deletes.
I'm using DBCC DROPCLEANBUFFERS
and DBCC FREEPROCCACHE
before each test. I've disabled all triggers that might be causing further slowdowns (although those would show up in the execution plan anyway). And I'm testing against older widgets, too, and noticing a significant speedup there as well; deletes that used to take 5 minutes now take 20-40 seconds.
Now I'm an ardent supporter of the "SELECT ain't broken" philosophy, but there just doesn't seem to be any logical explanation for this behaviour other than crushing, mind-boggling inefficiency of the CASCADE DELETE
relationships.
So, my questions are:
Is this a known issue with DRI in SQL Server? (I couldn't seem to find any references to this sort of thing on Google or here in SO; I suspect the answer is no.)
If not, is there another explanation for the behaviour I'm seeing?
If it is a known issue, why is it an issue, and are there better workarounds I could be using?