views:

231

answers:

4

I have two tables: a schedule table that contains information about how an employee is scheduled and a numbers table in which each number corresponds to a date.

The tables look like:

[Employee Schedule]

ID          Employee ID Project ID  Day ID
----------- ----------- ----------- -----------
1           64          2           168
2           64          2           169
3           64          2           170
4           64          2           171
5           64          1           169
6           64          1           170
7           64          1           171
8           64          1           172
9           64          2           182
10          64          2           183
11          64          2           184

and

[Day Numbers]

ID          Day
----------- ----------
168         2009-06-18
169         2009-06-19
170         2009-06-20
171         2009-06-21
172         2009-06-22
173         2009-06-23
174         2009-06-24
175         2009-06-25
176         2009-06-26
177         2009-06-27
178         2009-06-28
179         2009-06-29
180         2009-06-30
181         2009-07-01
182         2009-07-02
183         2009-07-03
184         2009-07-04

As you can see, Employee 64 is scheduled on project 1 from 2009-06-19 to 2009-06-22 and project 2 from 2009-06-18 to 2009-06-21 and again from 2009-07-02 to 2009-07-04.

My question is: what algorithm can I use to quickly determine the spans of the employee's schedule in a fashion such that I can display it as follows?

 Employee ID Project ID Duration
 ----------- ---------- ------------
 64          1          2009-06-19 to 2009-06-22
 64          2          2009-06-18 to 2009-06-21
 64          2          2009-07-02 to 2009-07-04

I can do this on the SQL side or the code side. I have Linq at my disposal if I need it. The table doesn't need to be compiled by SQL. This will happen dynamically on a website and should be as efficient as possible. I don't want to have to iterate through each and look for breaks in contiguous days if I don't have to.

A: 

I haven't tested, but try:

select [Employee ID], [Project ID], start + ' to ' + end
from (
    select s.[Employee ID], s.[Project ID], min(d.Day) start, max(d.Day) end
    from [Employee Schedule] s
    inner join [Day Numbers] d on s.[Day ID] = d.[Day ID]
    group by s.[Employee ID], s.[Project ID]
) a

Edit: corrected some column names

For easier querying, I recommend you refactor the schema to:

[EmployeeSchedule]

ID 
EmployeeID 
ProjectID  
StartDate 
EndDate

and get rid of Day Numbers completely. That will make your queries simpler, more efficient, and will allow you to have records with NULL StartDates or EndDates if you wish.

RedFilter
Does this deal with the 2 ranges for projectid = 2?
gbn
@gbn. Yes it does.
Triptych
That won't work. There may be gaps between the min and max that shouldn't be included in the duration.
hypoxide
End of year will be interesting!
n8wrl
@n8wrl - Luckily there's a numbers table so it won't matter. :)
hypoxide
That's true, it will not handle multiple entries for same employee/project. Can I suggest you refactor your schema to have StartDay and EndDay columns in your [Employee Schedule] table? A query can be written to work with the existing schema, but it is more work and less efficient than if you change the schema.
RedFilter
A schema refactor would hinder my ability to find gaps and overlaps in employee scheduling, which is why I'm using a numbers table for the dates.
hypoxide
How so? You could do: /* gaps */select s.* from [Employee Schedule] s inner join [Day Numbers] d on d.[Day ID] not between s.StartDay and s.EndDay/* overlaps */select * from [Employee Schedule] s inner join [Employee Schedule] s2 on (s2.StartDay between s.StartDay and s.EndDay or s2.EndDay between s.StartDay and s.EndDay or s.StartDay between s2.StartDay and s2.EndDay or s.EndDay between s2.StartDay and s2.EndDay) and s.ID <> s2.ID
RedFilter
+1  A: 

Lets make a view to make things easier:

create view EmployeeProjectDates
as
select
    e.[Employee ID], e.[Project ID], d.Day
from
    [Employee Scchedule] e
    join [Day Numbers] d on e.[Day Id] = d.ID

You can do a query like this to get all the start dates:

select
    one.[Employee ID], one.[Project ID], one.Day as StartDate
from
    EmployeeProjectDays one
    left join EmployeeProjectDays two on one.[Employee ID] = two.[Employee ID] and one.[Project ID] = two.[Project ID] and one.Day = DATEADD(DAY, 1, two.Day)
where
    two.Day is null

And then do a similar query to get the end dates and match them up. I think that something like this would get you both.

select
    one.[Employee ID], one.[Project ID], one.Day as StartDate,
    (select
     min(two_end.Day)
 from
     EmployeeProjectDays one_end
     join EmployeeProjectDays two_end on one_end.[Employee ID] = two_end.[Employee ID] and one_end.[Project ID] = two_end.[Project ID] and one.Day = DATEADD(DAY, 1, two.Day)
     where
     one_end.Day is null
     and two_end.Day > one.Day) as EndDate
from
    EmployeeProjectDays one
    left join EmployeeProjectDays two on one.[Employee ID] = two.[Employee ID] and one.[Project ID] = two.[Project ID] and one.Day = DATEADD(DAY, 1, two.Day)
where
    two.Day is null

I haven't tested any of these queries, but something similar should work. I had to use a similar query before we implemented something in our application code to find the start and end dates.

Adam Hughes
I tried your code and I got null values for all end dates. I don't quite understand the end-date nested join well enough to debug it.
hypoxide
The subquery for the End Date is basically the same as the query for the start date, just reversed. It should give you a list of all the end dates, but for each row you only want the first end date that is after the start date.The subquery probably also needs to make sure the Employee ID and Project ID match the values from the main query, which I neglected.
Adam Hughes
+3  A: 

Assuming the Day IDs are always sequential for a partial solution...

select *
  from employee_schedule a                    
 where not exists( select *                          
                     from employee_schedule b        
                    where a.employeeid = b.employeeid
                      and a.projectid  = b.projectid 
                      and (a.dayid - 1) = b.dayid )

lists the start day IDs:

 ID      EMPLOYEEID       PROJECTID           DAYID 
 1              64               2             168 
 5              64               1             169 
 9              64               2             182 



select *
  from employee_schedule a                   
 where not exists( select *                         
                     from employee_schedule b       
                    where a.employeeid = b.employeei
                      and a.projectid  = b.projectid
                      and (a.dayid + 1) = b.dayid )

lists the end day IDs:

  ID      EMPLOYEEID       PROJECTID           DAYID 
  4              64               2             171 
  8              64               1             172 
 11              64               2             184
Paul Morgan
even if they are not sequential, the query can be rewritten using actual dates with DATEADD function and the same comparison
van
+1  A: 

This one works with oracle, and starting from that it should be possible in SQL Server as well. (including testscript)

create table schedule (id number, employee_id number, project_id number, day_id number);

insert into schedule (id, employee_id, project_id, day_id)
values(1,64,2,168);
insert into schedule (id, employee_id, project_id, day_id)
values(2,64,2,169);
insert into schedule (id, employee_id, project_id, day_id)
values(3,64,2,170);
insert into schedule (id, employee_id, project_id, day_id)
values(4,64,2,171);
insert into schedule (id, employee_id, project_id, day_id)
values(5,64,1,169);
insert into schedule (id, employee_id, project_id, day_id)
values(6,64,1,170);
insert into schedule (id, employee_id, project_id, day_id)
values(7,64,1,171);
insert into schedule (id, employee_id, project_id, day_id)
values(8,64,1,172);
insert into schedule (id, employee_id, project_id, day_id)
values(9,64,2,182);
insert into schedule (id, employee_id, project_id, day_id)
values(10,64,2,183);
insert into schedule (id, employee_id, project_id, day_id)
values(11,64,2,184);
insert into schedule (id, employee_id, project_id, day_id)
values(11,65,3,184);

select * 
FROM (
    select  
        employee_id,
        project_id,
        first_day,
        nvl(last_day, 
            lead(last_day) over (
                partition by employee_id, project_id 
                order by nvl(first_day, last_day)
            )
        ) last_day
    from (
        select -- this identifies start and end rows of an interval
            employee_id,
            project_id,
            decode (day_id - prev_day, 1, null, day_id) first_day, -- uses day_id, if prev_day is not really the previous day, i.e. a gap or null
            decode (day_id - next_day, -1, null, day_id) last_day
        from (
            select -- this select adds columns for the previous and next day, in order to identify the boundaries of intervals 
                employee_id, 
                project_id, 
                day_id, 
                lead(day_id) over ( 
                    partition by employee_id, project_id 
                    order by day_id
                ) next_day,
                lag(day_id) over ( 
                    partition by employee_id, project_id 
                    order by day_id
                ) prev_day
            from schedule
        )
    )
    where first_day is not null 
    or last_day is not null-- just filter the rows, that represent start or end dates
) 
where first_day is not null

produces this output:

64  1 169 172
64  2 168 171
64  2 182 184
65  3 184 184
Jens Schauder
That is a beautiful solution. If only there were lag and lead functions in T-SQL.
hypoxide