views:

44

answers:

2

I have the following tables:

Entry
EntryID - int
EntryDate - datetime

Hour
EntryID - int
InHour - datetime
OutHour - datetime

For each registry in the Entry table, there should be at least one (could be many) registries on the Hour table, like so:

Entry
EntryID: 8
EntryDate: 9/9/2010 12:31:25

Hour
EntryID: 8
InHour: 9/9/2010 12:31:25
OutHour: 9/9/2010 18:21:19

Now, this information is stored on 2 equal databases, one on local machine and one on a server. I'm trying to write a query that will delete all the information that has already been passed to the server under the condition that the registries that do not have an OutHour (null) will not be deleted.

I wrote the following query:

DELETE from [dbo].[Entry]
 WHERE [dbo].[Entry].[EntryID] IN (SELECT [EntryID] 
                                     FROM [LINKEDSERVER].[MYDATABASE].[dbo].[Entry]) 
  AND [dbo].[Entry].[EntryID] IN (SELECT [EntryID] 
                                    FROM [dbo].[Hour] 
                                   WHERE [OutHour] IS NOT NULL)

DELETE from [dbo].[Hour]
 WHERE [dbo].[Hour].[InHour] IN (SELECT [InHour] 
                                   FROM [LINKEDSERVER].[MYDATABASE].[dbo].[Hour]) 
   AND [dbo].[Hour].[OutHour] IS NOT NULL 

AFAIK, this query first checks in the Entry table and will delete any registries that are already on the server and do not have a corresponding Hour registry that has a null OutHour. However today I found out that an Entry record was deleted but the corresponding Hour wasn't (it had a null OutHour).

What am I doing wrong? Any help is appreciated.

Thanks!

+1  A: 

What's going wrong is that your second query only uses InHour, without referring to the EntryID. Also, your first query has its conditions completely independent from each other, which may not be a problem if your Hour table constraints are correct (the first column can never be null when the second is not null), but it's worth looking at.

In relational databases, it's best to get in the habit of thinking in terms of JOINs rather than IN(). Using IN() can often return the same results as a JOIN (with some differences in NULL handling) and often even gets the same execution plan, but it is #1 a "relaxed" way of thinking about the problem which doesn't lend itself well to the mental space needed for writing complex queries and #2 can't compare multiple values at once, it can only do a single comparison (at least in SQL Server, since some other DBMSes can do this).

Let me rewrite your queries as JOINs and maybe it will help you see what's wrong.

DELETE E
FROM
   dbo.Entry E
   INNER JOIN LINKEDSERVER.MYDATABASE.dbo.Entry L ON E.EntryID = L.EntryID
   INNER JOIN Hour H ON E.EntryID = H.EntryID
WHERE
   H.OutHour IS NOT NULL

DELETE H
FROM
   dbo.Hour H
   INNER JOIN LINKEDSERVER.MYDATABASE.dbo.Hour L ON H.InHour L.InHour
WHERE
   H.OutHour IS NOT NULL

I recommend you put a cascade delete foreign key constraint on the hour table so that when you delete from the Entry table, the child Hour rows all disappear. There are still problems here as you could have many Hour rows per EntryID and semantically you can end up trying to delete the same row over the linked server multiple times.

Also, be aware that huge joins over linked servers can experience very poor performance because sometimes the query engine decides to pull huge rowsets over the link, even entire tables. You can mitigate this by doing things in batches, perhaps by first doing a select into a temp table based on a JOIN across the link, then deleting corresponding rows in small batches of 100 or 1000 or 5000 (testing is in order to find the right size).

Last, if you do find that your queries are unnecessarily pulling huge sets of data over the link (determine this by running Query Profiler on the remote matchine to see what actual queries are being submitted), then strategic use of CROSS APPLY can help by forcing row-by-row processing, which in the case of linked servers can be an enormous performance improvement, despite how counter-intuitive that is compared to the standard and strong recommendation to never do row-by-row in relational databases. Think of it as forcing a "stretch bookmark lookup" rather than a "stretch table scan" and you'll get an inkling of why this can be such a big help.

Emtucifor
A: 

My very first suggestion is to put a foreign key relationship between the two on EntryID. This will prevent any deletions from the Entry table without first removing all instances from the Hour table.

Secondly, with a foreign key in place you have to do it from the child to the parent (aka, start at the bottom of the hierarchy). This means i would do this first:

delete from dbo.Hour where OutHour is not null
delete e
from dbo.Entry e
left outer join dbo.Hour h
on e.entryid=h.entryid
where h.entryid is null
DForck42