views:

716

answers:

4

Here's the scenario: there is a Software table (PK = SoftwareID) and an associated Release table (PK = [SoftwareID,Version]).

A release can be major or minor, release type is identified by Release.ReleaseType ('MAJ', 'MIN').

A release is also characterized by a date: Release.ReleaseDate.

Software is partitioned into categories, identified by Software.CategoryID.

Problem: need an effective T-SQL query to list all software pieces of a certain category and having the first major release date falling inside a given interval, delimited by @DateFrom, @DateTo. The only columns needed in the final resultset are SoftwareID and ReleaseDate.

This is not the real case scenario but I formulated it this way to be easier to understand. In the real case the table Release would have around 10 million records and the table Software around 1 million. I came up already with a solution but it's quite slow and I feel the experts around here might find something better.

Here's my slow solution:

select  s.SoftwareID, min(r.ReleaseDate)
from
    Software s inner join Release r on (s.SoftwareID = r.SoftwareID)
where s.CategoryID = @Category
      and r.ReleaseType = 'MAJ'
group by
    s.SoftwareID
having
    min(r.ReleaseDate) >= @DateFrom
    and min(r.ReleaseDate) < @DateTo

Thanks.

+2  A: 

Your query is good.

You may want to make sure you have proper indexes for your query:

  • in "MSSQL Management Studio": "Query"->"Analyze Query in Database Engine Tuning Advisor", then follow the "white rabbit" (preferred)
  • in "MSSQL Management Studio": "Query"->"Include Actual Execution Plan". Then run the query, and look if the execution plan suggests new indexes.
van
@Sorin: take a look at solution of Tomalak: he has a very good point with the order of filter/group-by.
van
+1  A: 

You could have an index problem. Have you tried making an index on the ReleaseDate column or creating a clustered index on the table containing ReleaseDate (sorted by ReleaseDate)?

Thies
+1 for suggesting the most likely cullprit but a clustered index on ReleaseDate doesn't make sense to me.
Lieven
+1  A: 

Try something along the lines of:

select  
  s.SoftwareID, 
  min(r.ReleaseDate)
from
  Software s 
  inner join Release r on s.SoftwareID = r.SoftwareID
where 
  s.CategoryID = @Category
  and r.ReleaseType = 'MAJ'
  and s.ReleaseDate >= @DateFrom
  and s.ReleaseDate < @DateTo
group by
  s.SoftwareID

The basic idea is: Why do you filter on a date after grouping, when you can prevent unwanted records entering the result before grouping takes place.

You are grouping on s.SoftwareID. To me it seems that there is no way that HAVING MIN(s.ReleaseDate) >= ... would affect different records than WHERE s.ReleaseDate >= ....

Tomalak
Filter by ReleaseDate in the where clause would bring false positives into the resultset. Most obvious example would be: interval set to a month, software has first major release in the previous month.
Sorin Comanescu
Then what about "WHERE r.ReleaseDate >= ..."? (I'm not sure why you have a ReleaseDate in the Software table at all.)
Tomalak
There was a mistake in the query, just fixed it.
Sorin Comanescu
+1 good point .
van
Tomalak, i'm not looking for the first major release of a given interval but for softwares having the earliest major release falling into that interval. Billy detailed this in his comments below.I think the having clause has to stay, unless i'm totally missing something major...
Sorin Comanescu
+1  A: 

Thank you all for your suggestions. Problem seems to be almost solved, i think there's not much to be done further.

The query advisor suggested 2 useful indexes, one of them being something like:

CREATE NONCLUSTERED INDEX [IX_Release_1234] ON [dbo].[Release] 
(
    [ReleaseType] ASC,
    [SoftwareID] ASC
)
INCLUDE ( [ReleaseDate]) WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

Index tuning cut the execution time to less than 50%. However, another slight improvement (25-30% of previously halved time) was to reorganize the query by first getting the release dates exclusively from Release table and then joining this subquery with Software:

select s.SoftwareID, r.ReleaseDate
from Software s inner join (
    select   SoftwareID, min(ReleaseDate) as ReleaseDate
    from     Release
    where    ReleaseType = 'MAJ'
    group by SoftwareID
    having
        min(ReleaseDate) >= @DateFrom
        and min(ReleaseDate) < @DateTo
) r on (
    s.SoftwareID = r.SoftwareID
)
where
    s.CategoryID = @Category

Now I guess index tuning should be redone :)...

Bottom line, use the engine tuning advisor whenever possible and thank you all again.

Sorin Comanescu
Suppose you have dates D1 and D2, and :software S1 - major releases : R1, R2 in [D1, D2), R3 in [D1, D2), R4, etcS2 - major releases : R1 in [D1, D2 ), R2 in [D1, D2), R3, R4, etcWhat do you want to have as results ?- S1-R2 and S2-R1 ( earlier major release in interval )- or only S2-R1 ( first major release in interval )
Billy
Just checking formatting capabilities, coming back soon...
Sorin Comanescu
Guess I'll have to edit the above response for better formatting :(
Sorin Comanescu
That's right, if S1-R1 falls before D1, then just S2-R1.
Sorin Comanescu
looks like html formatting doesn't work in the comments, see mine above that's pretty messed up :-)
Billy