views:

709

answers:

3
A: 

I'm not proud of this, it makes many assumptions (e.g. that CreatedOn is date only, and (LineNUmber,CreatedOn) is a key. Some tuning may be required, only works with test data.

In other words, I created this more for intellectual curiosity rather than because I think it's a genuine solution. Final select could be an update to set IsRepeat in the base table, based on existence on rows in V4. Final note before letting people see evil - could people please post test data in comments for data sets that it doesn't work for. It might be possible to turn this into a real solution:

with V1 as (
select t1.LineNumber,t1.CreatedOn,t2.CreatedOn as PrevDate from
T1 t1 inner join T1 t2 on t1.LineNumber = t2.LineNumber and t1.CreatedOn > t2.CreatedOn and DATEDIFF(DAY,t2.CreatedOn,t1.CreatedOn) < 7
), V2 as (
select v1.LineNumber,v1.CreatedOn,V1.PrevDate from V1
union all
select v1.LineNumber,v1.CreatedOn,v2.PrevDate from v1 inner join v2 on V1.LineNumber = v2.LineNumber and v1.PrevDate = v2.CreatedOn
), V3 as (
select LineNumber,CreatedOn,MIN(PrevDate) as PrevDate from V2 group by LineNumber,CreatedOn
), V4 as (
select LineNumber,CreatedOn from V3 where DATEDIFF(DAY,PrevDate,CreatedOn) < 7
)
select
    T1.LineNumber,
    T1.CreatedOn,
    CASE WHEN V4.LineNumber is Null then 0 else 1 end as IsRepeat
from
    T1
     left join
    V4
     on
      T1.LineNumber = V4.LineNumber and
      T1.CreatedOn = V4.CreatedOn
order by T1.CreatedOn,T1.LineNumber
option (maxrecursion 7)
Damien_The_Unbeliever
LineNumber is not part of the Primary KeyCreatedOn has a time component and is essentially smalldatetimeI'll post some data for you very shortlyI'm not sure how your query works yet, but on a very limited set of data I had cooked up it seems to work :)
BlackMael
-1. This returns all rows for a LineNumber > seven days from the first as being a non-repeat. It doesn't handle repeats after the second non repeat row. Look at LineNumber 1005 in the test data posted by OP. Only created on 2009-07-01 10:51:00 and 2009-07-09 13:10:00 should be IsRepeate=False. You have all rows >= 2009-07-08 09:47:00 (Of course, the 07-08 comes from doing whole days, but you should not be getting every date since than as a non-repeat, unless I have misunderstood OP.)
Shannon Severance
This was posted before BlackMaels update, when the only test data was the 6 line table at the top of the post. Based on that data, it returned the correct result set.
Damien_The_Unbeliever
+1  A: 

Ignores LineNumber is null. How should IsRepeat be handled in that case?

It works for test data. Whether it will be efficient enough for production volumes?

In the case of duplicate (LineNumber, CreatedOn) on pairs, arbitrarily choose one. (The one with minimum JobId)

Basic idea:

  1. Get all JobId pairs that are at least seven days apart, by line number.
  2. Count the number of rows that are more than seven days from the left side, upto and including the right side. (CNT)
  3. Then we know if JobId x is not a repeat, the next not a repeat is the pair with X on the left side, and CNT = 1
  4. Use recursive CTE to start with the first row for each LineNumber
  5. Recursive element uses the pair with counts to get the next row.
  6. Finally update, setting all IsRepeat to 0 for non-repeats and 1 for everything else.


; with AllPairsByLineNumberAtLeast7DaysApart (LineNumber
            , LeftJobId
            , RightJobId
            , BeginCreatedOn
            , EndCreatedOn) as
        (select l.LineNumber
            , l.JobId
            , r.JobId
            , dateadd(day, 7, l.CreatedOn)
            , r.CreatedOn
        from Job l
        inner join Job r
            on l.LineNumber = r.LineNumber
            and dateadd(day, 7, l.CreatedOn) < r.CreatedOn
            and l.JobId <> r.JobId)
    -- Count the number of rows within from BeginCreatedOn 
    -- up to and including EndCreatedOn
    -- In the case of CreatedOn = EndCreatedOn, 
    -- include only jobId <= jobid, to handle ties in CreatedOn        
    , AllPairsCount(LineNumber, LeftJobId, RightJobId, Cnt) as
        (select ap.LineNumber, ap.LeftJobId, ap.RightJobId, count(*)
        from AllPairsByLineNumberAtLeast7DaysApart ap
        inner join Job j
            on j.LineNumber = ap.LineNumber
            and ap.BeginCreatedOn <= j.createdOn
            and (j.CreatedOn < ap.EndCreatedOn
                or (j.CreatedOn = ap.EndCreatedOn 
                    and j.JobId <= ap.RightJobId))
         group by ap.LineNumber, ap.LeftJobId, ap.RightJobId)
    , Step1 (LineNumber, JobId, CreatedOn, RN) as
        (select LineNumber, JobId, CreatedOn
            , row_number() over 
                (partition by LineNumber order by CreatedOn, JobId)
        from Job)
    , Results (JobId, LineNumber, CreatedOn) as    
        -- Start with the first rows.
        (select JobId, LineNumber, CreatedOn
        from Step1
        where RN = 1
        and LineNumber is not null
        -- get the next row
        union all
        select j.JobId, j.LineNumber, j.CreatedOn
        from Results r
        inner join AllPairsCount apc on apc.LeftJobId = r.JobId
        inner join Job j
            on j.JobId = apc.RightJobId
            and apc.CNT = 1)
    update j
    set IsRepeat = case when R.JobId is not null then 0 else 1 end
    from Job j
    left outer join Results r
        on j.JobId = R.JobId
    where j.LineNumber is not null

EDIT:

After I turned off the computer last night I realized I had made things more complicated than they needed to be. A more straightforward (and on the test data, slightly more effecient) query:

Basic Idea:

  1. Generated PotentialStep (FromJobId, ToJobId) These are the pairs where if FromJobId is not a repeat, than ToJobId is also not a repeat. (First row by LineNumber more than seven days from FromJobId)
  2. Use a recursive CTE to start from the first JobId for each LineNumber and then step, using PontentialSteps, to each Non Repeating JobId


; with PotentialSteps (FromJobId, ToJobId) as
    (select FromJobId, ToJobId
    from (select f.JobId as FromJobId
            , t.JobId as ToJobId
            , row_number() over
                 (partition by f.LineNumber order by t.CreatedOn, t.JobId) as RN
        from Job f
        inner join Job t
            on f.LineNumber = t.LineNumber
            and dateadd(day, 7, f.CreatedOn) < t.CreatedOn) t
        where RN = 1)
, NonRepeats (JobId) as
    (select JobId
    from (select JobId
            , row_number() over
                (partition by LineNumber order by CreatedOn, JobId) as RN
        from Job) Start
    where RN = 1
    union all
    select J.JobId
    from NonRepeats NR
    inner join PotentialSteps PS
        on NR.JobId = PS.FromJobId
    inner join Job J
        on PS.ToJobId = J.JobId)
update J
set IsRepeat = case when NR.JobId is not null then 0 else 1 end
from Job J
left outer join NonRepeats NR
on J.JobId = NR.JobId
where J.LineNumber is not null
Shannon Severance
Wow! I really must get on top of CTEs! Examples like this truly push my synapses while I get use to them. Looking forward to point this one through its paces.. :)
BlackMael
It also produces an intriguing execution plan... For those sad enough to be interested. Damn, I guess I'm sad enough since I looked already..
BlackMael
It ignores LineNumber IS NULL, but thats okay. I left IsRepeat for NULLs just in case I needed to care. On the most part I think I only really need to default to FALSE if LineNumber IS NULL
BlackMael
Shannon Severance
That's the plan once I manage to get my dev machine up and running so that I can try out your solutions
BlackMael
A: 
UPDATE Jobs 
SET Jobs.IsRepeat = 0 -- mark all of them IsRepeat = false

UPDATE Jobs 
SET Jobs.IsRepeat = 1
WHERE EXISTS 
   (SELECT TOP 1 i.LineNumber FROM Jobs i WHERE i.LineNumber = Jobs.LineNumber
    AND i.CreatedOn <> Jobs.CreatedOn and i.CreatedOn BETWEEN Jobs.CreatedOn - 7 
    AND Jobs.CreatedOn)

NOTE: I hope this helps you somewhat. Let me know, if you find any discrepancy that you will come across on a larger data set.

shahkalpesh
Sorry, this does not take into account that a Job is not a repeat if the only other job with the same line number within 7 days is a repeat.
BlackMael
@BlackMael: Could you show that with an example please?
shahkalpesh