views:

108

answers:

5

I am facing a conceptual problem that I am having a hard time overcoming. I am hoping the SO folks can help me overcome it with a nudge in the right direction.

I am in the process of doing some ETL work with the source data being very similar and very large. I am loading it into a table that is intended for replication and I only want the most basic of information in this target table.

My source table looks something like this:

alt text

I need my target table to reflect it as such:

alt text

As you can see I didn't duplicate the InTransit status where it was duplicated in the source table. The steps I am trying to figure out how to achieve are

  1. Get any new distinct rows entered since the last time the query ran. (Easy)
  2. For each TrackingId I need to check if each new status is already the most recent status in the target and if so disregard otherwise go ahead and insert it. Which this means I have to also start at the earliest of the new statuses and go from there. (I have no *(!#in clue how I'll do this)
  3. Do this every 15 minutes so that statuses are kept very recent so step #2 must be performant.

My source table could easily consist of 100k+ rows but having the need to run this every 15 minutes requires me to make sure this is very performant thus why I am really trying to avoid cursors.

Right now the only way I can see to do this is using a CLR sproc but I think there may be better ways thus I am hoping you guys can nudge me in the right direction.

I am sure I am probably leaving something out that you may need so please let me know what info you may need and I'll happily provide.

Thank you in advance!

EDIT: Ok I wasn't explicit enough in my question. My source table is going to contain multiple tracking Ids. It may be up to 100k+ rows containing mulitple TrackingId's and multiple statuses for each trackingId. I have to update the target table as above for each individual tracking Id but my source will be an amalgam of trackingId's.

+1  A: 

How well this performs will depend on indexes, and particularly if you are targeting a single TrackingID at a time, but this is one way to use a CTE and self-join to obtain the desired results:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
    WHERE TrackingID = 1
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    [StatusDate];
GO
DROP TABLE #foo;

If you need to support multiple TrackingIDs in the same query:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';
INSERT #foo SELECT 2, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 2, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 2, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 2, 'AtDest',    '2010-10-04 13:42';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
    AND a.TrackingID = b.TrackingID
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY TrackingID, [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    TrackingID, 
    [StatusDate];
GO
DROP TABLE #foo;
Aaron Bertrand
@Aaron this works for half of my problem. You were able to filter out sequential statuses from the source, but it ultimately doesn't address any issues I would have with the target. As I said in my orig question in step 2 I need to worry about inserting a duplicate sequential status in my target as well. This doesn't assure me that this wouldn't happen. I am giving you a +1 for an elegant solution in retrieving the source data.
joshlrogers
A: 

If this is SQL 2005 then you can use ROW_NUMBER with a sub query or CTE: If the dataset is really huge though and performance is an issue then one of the above that got pasted while I was trying to get the code block to work could well be more efficient.

/**
*  This is just to create a sample table to use in the test query
**/

DECLARE @test TABLE(ID INT, TrackingID INT, Status VARCHAR(20), StatusDate DATETIME)
INSERT    @test
SELECT    1,1,'PickedUp', '01 jan 2010 08:00' UNION
SELECT    2,1,'InTransit', '01 jan 2010 08:01' UNION
SELECT    3,1,'InTransit', '01 jan 2010 08:02' UNION
SELECT    4,1,'Delayed', '01 jan 2010 08:03' UNION
SELECT    5,1,'InTransit', '01 jan 2010 08:04' UNION
SELECT    6,1,'AtDest', '01 jan 2010 08:05' UNION
SELECT    7,1,'Deliv', '01 jan 2010 08:06'


/**
*  This would be the select code to exclude the duplicate entries. 
*  Sorting desc in row_number would get latest instead of first
**/
;WITH n AS
(
    SELECT    ID,
            TrackingID,
            Status,
            StatusDate,
            --For each Status for a tracking ID number by ID (could use date but 2 may be the same)
            ROW_NUMBER() OVER(PARTITION BY TrackingID, Status ORDER BY ID) AS [StatusNumber]
    FROM    @test
)
SELECT    ID,
        TrackingID,
        Status,
        StatusDate
FROM    n
WHERE    StatusNumber = 1
ORDER    BY ID
Bodestone
You query does not return entry `5` from the original table.
Quassnoi
+2  A: 

Here's a solution without self-joins:

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    tracking
        WHERE   tackingId = @id
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

Here's a script to check:

DECLARE @tracking TABLE
        (
        id INT NOT NULL PRIMARY KEY,
        trackingId INT NOT NULL,
        status INT,
        statusDate DATETIME
        )

INSERT
INTO    @tracking
SELECT  1, 1, 1, DATEADD(d, 1, '2010-01-01')
UNION ALL
SELECT  2, 1, 2, DATEADD(d, 2, '2010-01-01')
UNION ALL
SELECT  3, 1, 2, DATEADD(d, 3, '2010-01-01')
UNION ALL
SELECT  4, 1, 2, DATEADD(d, 4, '2010-01-01')
UNION ALL
SELECT  5, 1, 3, DATEADD(d, 5, '2010-01-01')
UNION ALL
SELECT  6, 1, 3, DATEADD(d, 6, '2010-01-01')
UNION ALL
SELECT  7, 1, 4, DATEADD(d, 7, '2010-01-01')
UNION ALL
SELECT  8, 1, 2, DATEADD(d, 8, '2010-01-01')
UNION ALL
SELECT  9, 1, 2, DATEADD(d, 9, '2010-01-01')
UNION ALL
SELECT  10, 1, 1, DATEADD(d, 10, '2010-01-01')
;
WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    @tracking
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate
Quassnoi
@Quassnoi - Do you think this will be equivalent in performance to a self-join, or would you expect it to run faster?
JNK
+1 - I can't get mine to work!
JNK
A: 

I think this example will do what you're looking for:

CREATE TABLE dbo.srcStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

CREATE TABLE dbo.tgtStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Delayed','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Delayed','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Deliv','10/4/2010 5:05 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Deliv','10/4/2010 5:05 PM');

WITH    cteSrcTrackingIds
          AS ( SELECT DISTINCT
                        TrackingId
               FROM     dbo.srcStatus
             ),
        cteAllTrackingIds
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate
               FROM     dbo.srcStatus
               UNION
               SELECT   tgtStatus.TrackingId ,
                        tgtStatuS.[Status] ,
                        tgtStatus.StatusDate
               FROM     cteSrcTrackingIds
                        INNER JOIN dbo.tgtStatus ON cteSrcTrackingIds.TrackingId = tgtStatus.TrackingId
             ),
        cteAllTrackingIdsWithRownums
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId ORDER BY StatusDate ) AS rownum
               FROM     cteAllTrackingIds
             ),
        cteTrackingIdsWorkingSet
          AS ( SELECT   src.rownum AS [id] ,
                        src2.rownum AS [id2] ,
                        src.TrackingId ,
                        src.[Status] ,
                        src.StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY src.TrackingId,
                                            src.rownum ORDER BY src.StatusDate ) AS rownum
               FROM     cteAllTrackingIdsWithRownums AS [src]
                        LEFT OUTER JOIN cteAllTrackingIdsWithRownums AS [src2] ON src.TrackingId = src2.TrackingId
                                                              AND src.rownum < src2.rownum
                                                              AND src.[Status] != src2.[Status]
             ),
        cteTrackingIdsSubset
          AS ( SELECT   id ,
                        TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId, id2 ORDER BY id ) AS rownum
               FROM     cteTrackingIdsWorkingSet
               WHERE    rownum = 1
             )
    INSERT  INTO dbo.tgtStatus
            ( TrackingId ,
              [status] ,
              StatusDate
            )
            SELECT  cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.[status] ,
                    cteTrackingIdsSubset.StatusDate
            FROM    cteTrackingIdsSubset
                    LEFT OUTER JOIN dbo.tgtStatus ON cteTrackingIdsSubset.TrackingId = tgtStatus.TrackingId
                                                     AND cteTrackingIdsSubset.[status] = tgtStatus.[status]
                                                     AND cteTrackingIdsSubset.StatusDate = tgtStatus.StatusDate
            WHERE   cteTrackingIdsSubset.rownum = 1
                    AND tgtStatus.id IS NULL
            ORDER BY cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.StatusDate;
SQLDBA
+1  A: 

Here you go. I'll let you clean it up and do optimizations. one of the sub queries can go into a view and the messy date comparison can be cleaned up. If you're using SQL 2008 R2 then use CAST as DATE instead.

    declare @tbl1 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

declare @tbl2 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

----Source data
insert into @tbl1 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(8,1,'InTransit','10/02/10  3:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(4,1,'Delayed','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(5,1,'InTransit','10/03/10 1:01')
insert into @tbl1 (id, trackingid, status, statusdate) values(6,1,'AtDest','10/03/10 2:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(7,1,'Deliv','10/03/10 3:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(3,2,'InTransit','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(9,2,'AtDest','10/04/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(10,2,'Deliv','10/04/10 1:05')
insert into @tbl1 (id, trackingid, status, statusdate) values(11,1,'Delayed','10/02/10 2:05')

----Target data
insert into @tbl2 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(3,1,'Deliv','10/03/10 3:00')


select d.* from
(
    select 
    * ,
    ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
    from @tbl1
) d

where 
not exists
(
    select RN from
    (
        select 
        * ,
        ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
        from @tbl1
    )f where f.RN = d.RN + 1 and d.status = f.status and f.trackingid = d.trackingid and 
    CAST((STR( YEAR( f.statusdate ) ) + '/' +STR( MONTH(f.statusdate ) ) + '/' +STR( DAY( f.statusdate ) )) AS DATETIME) =
            CAST((STR( YEAR( d.statusdate ) ) + '/' +STR( MONTH(d.statusdate ) ) + '/' +STR( DAY( d.statusdate ) )) AS DATETIME)
)

and
not exists 
(
    select 1 from @tbl2 t2
    where (t2.trackingid = d.trackingid
    and t2.statusdate = d.statusdate
    and t2.status = d.status)
)
and (
    not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
    or not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            and t2.statusdate <= d.statusdate
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
)
order by trackingid,statusdate
DustinDavis
This achieves everything I needed, thanks so much!
joshlrogers