views:

318

answers:

4

Problem


Given the following two tables, I'd like to select all Ids for Posts that have their most recent (i.e. last) comment made in the given time span (e.g. Feb 2010).

The result of the query should only return Post ID 1, since the most recent comment for Post ID 2 is outside the range of the time span filter.

Question


I've created the SELECT statement below that seems correct and handles all the test cases thrown at it.

However, in a quest to continue to improve my SQL skills, I'm asking the community if there is a "better" method to use for this scenario, any suggestions on improving the existing statement, and/or edge cases that are not covered.

Note that this is an loose translation of the actual tables, changed with the intent of making the question easier to understand. For what it's worth, I'm using SQL Server 2005.

Tables


Post

Id    Text     Visible
1     Post 1   1
2     Post 2   1
3     Post 3   0
.     ...
n     Post n   1

Comment

Id    Post_Id    Text                  CommentNumber    Timestamp
1     1          Comment 1, Post 1     1                2/3/2010
2     1          Comment 2, Post 1     2                2/4/2010
3     2          Comment 1, Post 2     1                3/1/2010
.     .          .
n     m          Comment n, Post m     x                xx/xx/xxxx


SQL Command


SELECT [Id],[Text]
FROM [Post]  
WHERE [Id] IN (  
    SELECT comment1.[Post_Id]  
    FROM (  
        SELECT max([CommentNumber]) as maxComment,  
            [Post_id]  
        FROM [Comment]  
        GROUP BY [Post_id]  
    ) as comment2  
    INNER JOIN [Comment] as comment1 on comment1.[Post_id] = comment2.[Post_id]  
    WHERE comment1.[Timestamp] BETWEEN '2/1/2010 00:00:00.000' AND '2/28/2010 23:59:59.999'  
    AND comment1.[CommentNumber] = comment2.maxComment  
)
AND [Post].[Visible] = 1


Bonus Question


Is it possible to create this query with NHiberate (either using the Criteria API or HQL)?

+4  A: 
SELECT
    Post_Id
FROM
    Comment
GROUP BY
    Post_Id
HAVING
    MAX(Timestamp) >= '2/1/2010'

Thinkg of HAVING as a WHERE that takes place after GROUP BY, operating on the grouped resultset.

Don't know about NHibernate though.

AakashM
Excellent, this is much cleaner.
Kevin Pullin
A: 

This should be a little quicker than using the HAVING clause.

select distinct Post_id from Comment
where Timestamp >= '2/1/2010';
ar
This is close, but it doesn't capture the requirement to include only the Posts that have their most recent comment made within the time range.
Kevin Pullin
+3  A: 

Good solutions have already been posted, but I thought I'd post an explanation on how your query can be simplified step-by-step:

The outermost subquery is redundant

The outermost part of the subquery (the SELECT [Id] FROM [Post] WHERE [Id] IN ( bit) is redundant, as you are already returning a list of Ids).

This leaves us with

SELECT comment1.[Post_Id]
FROM (  
    SELECT max([CommentNumber]) as maxComment,  
        [Post_id]  
    FROM [Comment]  
    GROUP BY [Post_id]  
) as comment2  
INNER JOIN [Comment] as comment1 on comment1.[Post_id] = comment2.[Post_id]  
WHERE comment1.[Timestamp] BETWEEN '2/1/2010 00:00:00.000' AND '2/28/2010 23:59:59.999'  
AND comment1.[CommentNumber] = comment2.maxComment  

The use of CommentNumber is redundant

There isn't any need to use CommentNumber to get the most recent comment as the posts are already ordered by Timestamp. This means that rather than selecting the TimeStamp of the comment with the highest Id we can just select the highest TimeStamp.

This eliminates the need to join to Comments again, leaving us with:

SELECT [Post_Id], SomeColumn, SomeOtherColumn
FROM (
    SELECT max([TimeStamp]) as maxTimeStamp,
        [Post_id],
        SomeColumn,
        SomeOtherColumn
    FROM [Comment]
    GROUP BY [Post_id]
) as GroupedComments
WHERE GroupedComments.maxTimeStamp BETWEEN '2/1/2010 00:00:00.000' AND '2/28/2010 23:59:59.999'

The subquery is now redundant

Now the query has been simplified somewhat it should be easy to see how it can be reduced further to one of the other solutions posted here using the distinct or having syntax.

Use < and >= rather than BETWEEN

Just a small niggle. Rather than goto great lengths to find the last date in february, splitting up the BETWEEN into a < and a >= makes the query much cleaner:

WHERE GroupedComments.maxTimeStamp >= '2/1/2010'
AND GroupedComments.maxTimeStamp < '3/01/2010'
Kragen
Thanks for the terrific explanations! A side note - the outermost query isn't actually redundant, as I am selecting more than just the Id from the Post table (this is my fault and I've since updated the post to help clarify the full intent of the query).
Kevin Pullin
@Kevin You can still have the query return more than just the ID from the post table even without the subqeury.
Kragen
@Kragen - That would require a join correct? And to include all extra columns in the GROUP BY clause? I've got a sample of this working and the query plans are ever so slightly different between the two; I wonder if one performs better than the other. The actual data set size is so small it'll never matter for this case, so I'll use the one that I find more readable.
Kevin Pullin
@Kevin Nope, you just need to add the columns to the subquery - I've edited the last snippet so that it gets a couple of extra columns as an example. You should find that the query plan is very similar as SQL server optimises the query anyway. For example the execution plan for `SELECT * FROM T` should be the same as the plan for `SELECT * FROM (SELECT * FROM T)`
Kragen
@Kragen - The intent of this query is to use the IDs from the Comments table to select data from the Post table, which is what I'm referring to as needed the join (e.g. to get [Post].[Text]). In your updated example, the additional columns must be included in the group by clause as well?
Kevin Pullin
@Kevin Ah ok yes, now I understand! Yes a join is needed in that case :-) (sorry! :-p)
Kragen
@Kragen - No worries - thanks again for your insights.
Kevin Pullin
A: 

This is the query I'm currently targeting after combining AakashM's and Kragen's responses:

SELECT [Id],[Text]
From [Post]
WHERE [Id] IN (
    SELECT Post_Id
    FROM Comment
    GROUP BY Post_Id
    HAVING MAX(Timestamp) >= '3/1/2010' AND MAX(Timestamp) < '4/1/2010'
)
AND [Post].[Visible] = 1

Here is how to represent this query in NHibernate using the Criteria API:

var subCriteria = DetachedCriteria.For<Comment>()
    .SetProjection(Projections.ProjectionList()
        .Add(Projections.GroupProperty("Post.Id")))
    .Add(Restrictions.Ge("Timestamp", new DateTime(2010, 3, 1)))
    .Add(Restrictions.Lt("Timestamp", new DateTime(2010, 4, 1)));

var criteria = session.CreateCriteria<Post>()
    .Add(Restrictions.Eq("Visible", true))
    .Add(Subqueries.PropertyIn("Id", subCriteria));
Kevin Pullin