Ignores LineNumber is null. How should IsRepeat be handled in that case?
It works for test data. Whether it will be efficient enough for production volumes?
In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)
Basic idea:
- Get all JobId pairs that
are at least seven days apart, by
line number.
- Count the number of
rows that are more than seven days
from the left side, upto and
including the right side. (CNT)
- Then we know if JobId x is not a repeat, the next not a repeat is the pair with X on
the left side, and CNT = 1
- Use recursive CTE to start with the first row for each LineNumber
- Recursive element uses the pair with counts to get the next row.
- Finally update, setting all IsRepeat to 0 for non-repeats and 1 for everything else.
; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
, LeftJobId
, RightJobId
, BeginCreatedOn
, EndCreatedOn) as
(select l.LineNumber
, l.JobId
, r.JobId
, dateadd(day, 7, l.CreatedOn)
, r.CreatedOn
from Job l
inner join Job r
on l.LineNumber = r.LineNumber
and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
and l.JobId <> r.JobId)
-- Count the number of rows within from BeginCreatedOn
-- up to and including EndCreatedOn
-- In the case of CreatedOn = EndCreatedOn,
-- include only jobId <= jobid, to handle ties in CreatedOn
, AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
(select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
from AllPairsByLineNumberAtLeast7DaysApart ap
inner join Job j
on j.LineNumber = ap.LineNumber
and ap.BeginCreatedOn <= j.createdOn
and (j.CreatedOn < ap.EndCreatedOn
or (j.CreatedOn = ap.EndCreatedOn
and j.JobId <= ap.RightJobId))
group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
, Step1 (LineNumber, JobId, CreatedOn, RN) as
(select LineNumber, JobId, CreatedOn
, row_number() over
(partition by LineNumber order by CreatedOn, JobId)
from Job)
, Results (JobId, LineNumber, CreatedOn) as
-- Start with the first rows.
(select JobId, LineNumber, CreatedOn
from Step1
where RN = 1
and LineNumber is not null
-- get the next row
union all
select j.JobId, j.LineNumber, j.CreatedOn
from Results r
inner join AllPairsCount apc on apc.LeftJobId = r.JobId
inner join Job j
on j.JobId = apc.RightJobId
and apc.CNT = 1)
update j
set IsRepeat = case when R.JobId is not null then 0 else 1 end
from Job j
left outer join Results r
on j.JobId = R.JobId
where j.LineNumber is not null
EDIT:
After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:
Basic Idea:
- Generated PotentialStep (FromJobId, ToJobId) These are the pairs where if FromJobId
is not a repeat, than ToJobId is also not a repeat. (First row by LineNumber more
than seven days from FromJobId)
- Use a recursive CTE to start from the first JobId for each LineNumber and then step,
using PontentialSteps, to each Non Repeating JobId
; with PotentialSteps (FromJobId, ToJobId) as
(select FromJobId, ToJobId
from (select f.JobId as FromJobId
, t.JobId as ToJobId
, row_number() over
(partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
from Job f
inner join Job t
on f.LineNumber = t.LineNumber
and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
where RN = 1)
, NonRepeats (JobId) as
(select JobId
from (select JobId
, row_number() over
(partition by LineNumber order by CreatedOn, JobId) as RN
from Job) Start
where RN = 1
union all
select J.JobId
from NonRepeats NR
inner join PotentialSteps PS
on NR.JobId = PS.FromJobId
inner join Job J
on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null