views:

70

answers:

3

I've got two tables in a SQL Server 2000 database joined by a parent child relationship. In the child database, the unique key is made up of the parent id and the datestamp.

I'm needing to do a join on these tables such that only the most recent entry for each child is joined.

Can anyone give me any hints how I can go about this?

A: 

Here's the most optimized way I've found to do this. I tested it against several structures and this way had the lowest IO compared to other approaches.

This sample would get the last revision to an article

SELECT t.*
FROM ARTICLES AS t
    --Join the the most recent history entries
        INNER JOIN  REVISION lastHis ON t.ID = lastHis.FK_ID
        --limits to the last history in the WHERE statement
            LEFT JOIN REVISION his2 on lastHis.FK_ID = his2.FK_ID and lastHis.CREATED_TIME < his2.CREATED_TIME
WHERE his2.ID is null
Laramie
Thank you, that totally sorted it for me.
BenAlabaster
I'm glad I could pass it along.
Laramie
Mighty inefficient O(N^2). Check out the ROW_NUMBER solution for O(N) complexity.
wqw
Can't use Row_Number() in a SQL Server 2000 database, so that doesn't help.
BenAlabaster
@wqw You might want to check out the tags before deducting points.
Laramie
Ooops, my bad! Of course you can't. This is still mighty inefficient and quickly turns into a mess if you "order by" more than one column `(lastHis.CREATED_TIME < his2.CREATED_TIME OR lastHis.CREATED_TIME = his2.CREATED_TIME AND lastHis.SecondColumn < his2.SecondColumn)` or more than one base table. I would personally use a `SELECT TOP 1` sub-query anytime, just did it on a million table (actually a couple of JOINed tables). Can't believe no one mentioned it as a reply already...
wqw
+1  A: 

If you had a table which just contained the most recent entry for each parent, and the parent's id, then it would be easy, right?

You can make a table like that by joining the child table on itself, taking only the maximum datestamp for each parent id. Something like this (your SQL dialect may vary):

   SELECT t1.*
     FROM child AS t1
LEFT JOIN child AS t2
       ON (t1.parent_id = t2.parent_id and t1.datestamp < t2.datestamp)
    WHERE t2.datestamp IS NULL

That gets you all of the rows in the child table for which no higher timestamp exists, for that parent id. You can use that table in a subquery to join to:

   SELECT *
     FROM parent
     JOIN ( SELECT t1.*
              FROM child AS t1
         LEFT JOIN child AS t2
                ON (t1.parent_id = t2.parent_id and t1.datestamp < t2.datestamp)
             WHERE t2.datestamp IS NULL ) AS most_recent_children
       ON (parent.id = most_recent_children.parent_id

or join the parent table directly into it:

   SELECT parent.*, t1.*
     FROM parent
     JOIN child AS t1
       ON (parent.id = child.parent_id)
LEFT JOIN child AS t2
       ON (t1.parent_id = t2.parent_id and t1.datestamp < t2.datestamp)
    WHERE t2.datestamp IS NULL
Ian Clelland
+1  A: 

Use this query as a basis Note that the CTE definition is not part of query-So the solution is simple

use test;
with parent as (
select 123 pid union all select 567 union all
select 125 union all 
select 789),
child as(
select 123 pid,CAST('1/12/2010' as DATE) stdt union all
select 123 ,CAST('1/15/2010' AS DATE) union all
select 567 ,CAST('5/12/2010' AS DATE) union all
select 567 ,CAST('6/15/2010' AS DATE) union all
select 125 ,CAST('4/15/2010' AS DATE) 
)
select pid,stdt from(
select a.pid,b.stdt,ROW_NUMBER() over(partition by a.pid order by b.stdt desc) selector
from parent as a
left outer join child as b
on a.pid=b.pid) as x
where x.selector=1
josephj1989
again, can't use this in SQL Server 2000
BenAlabaster