views:

31

answers:

1

I've got a database table of type-2 data, and I want to find records that were deleted since I last synced with it. It's got date_from and date_to columns, and the raw data has an ID column object_id. date_to<>null means it doesn't exist now, so if there's no other record with the same object_id and date_to=null, then it's been deleted.

I believe a naive implementation would be something like:

select * from data_t2 a
where a.date_to > last_sync_date and a.date_to < current_date()
and not exists (select * from data_t2 b
                where b.date_to is null and b.object_id = a.object_id);

but obviously that's going to be ridiculously expensive.

Is there an obvious more efficient way that I'm missing? I suspect there isn't (or rather, that I should assume there are relatively few deleted records, and do some of the computation outside the RDBMS), but I figured I'd ask just in case.

Thanks!

+1  A: 

Before you start tuning the query, you really should run EXPLAIN or apply some other diagnostics. Otherwise you cannot see how a rewrite changes the plan

You can rewrite this with an outer join. In for example MySQL, this will be much faster than the subquery:

SELECT    * 
FROM      data_t2 a
LEFT JOIN data_t2 b
ON        a.object_id = b.object_id
AND       b.date_to IS NULL
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL

If the dimension table is really large, and there is an index that has date_to as first column, and the number of rows having date_to IS NULL is a small fraction of the entire table, this might be faster still:

SELECT    * 
FROM      data_t2 a
LEFT JOIN (
          SELECT object_id
          FROM   data_t2 b
          WHERE  b.date_to IS NULL
          )
ON        a.object_id = b.object_id
WHERE     a.date_to > last_sync_date 
AND       a.date_to < current_date()    
AND       b.object_id IS NULL
Roland Bouman
Yeah, I knew it's possible to turn a subquery with a join, and I'm a total EXPLAIN junkie. This looks fairly straightforward. Adding an index on date_to sounds entirely reasonable, if it helps much here. This is a good starting point.
Ken