views:

2622

answers:

11

I'm working on a query that needs to have some data rows combined based on date ranges. These rows are duplicated in all the data values, except the date ranges are split. For example the table data may look like

StudentID   StartDate EndDate  Field1  Field2
1    9/3/2007 10/20/2007 3  True
1    10/21/2007 6/12/2008 3  True
2    10/10/2007 3/20/2008 4  False
3    9/3/2007 11/3/2007 8  True
3    12/15/2007 6/12/2008 8  True

The result of the query should have the split date ranges combined. The query should combine date ranges with a gap of only one day. If there is more than a one day gap, then the rows shouldn't be combined. The rows that don't have a split date range should come through unchanged. The result would look like

StudentID   StartDate EndDate  Field1  Field2
1    9/3/2007 6/12/2008 3  True
2    10/10/2007 3/20/2008 4  False
3    9/3/2007 11/3/2007 8  True
3    12/15/2007 6/12/2008 8  True

What would be the SELECT statement for this query?

A: 

In my experience, I have to combine the ranges in post-processing (not in SQL but in my script). I'm not sure that a SQL can do this, particularly because you can never know exactly how many date ranges need to be chained in any particular case. If this can be done though, I'd love to know too.

EDIT: My answer is assuming that you have more than one range of dates per student, not just a start and an end. If you only have the one date range with no gaps, then the other mentioned solutions are the way to go.

Adam Bellaire
A: 

Select StudentID, min(startdate) as startdate, max(enddate), field1, field2 from tablex group by StudentID, field1, field2

That would yield you the result assuming the wasn't a gap between on student's time range.

Scott Bevington
A: 
select StudentID, min(StartDate) StartDate, max(EndDate) EndDate, Field1, Field2 
  from table
 group by StudentID, Field1, Field2
Joe Skora
A: 

If the min()/max() solutions are not good enough (e.g. if the dates are not contiguous and you want to group separate date ranges separately), I wonder if something using Oracle's START WITH and CONNECT BY clauses would work. Which, of course, wouldn't work on every database.

runrig
A: 

EDIT: Make another set of SQL for Access. I tested all of this, but piece by piece because I don't know how to make several statements at one time in Access. Since I also don't know how to do comments, you can see the comments in the SQL version, below.

select 
studentid, min(startdate) as Starter, max(enddate) as Ender, field1, field2, 
max(startDate) - Min(endDate)  as MaxGap 
into tempIDs
from student 
group by studentid, field1, field2 ;  

delete from tempIDs where MaxGap > 1;

UPDATE student INNER JOIN TempIDs ON Student.studentID = TempIDS.StudentID
SET Student.StartDate = [TempIDs].[Starter],
 Student.EndDate = [TempIDs].[Ender];

I think this is it, in SQL Server - I didn't do it in Access. I haven't tested it for fancy conditions such as overlapping several records, etc., but this should get you started. It updates all the duplicate, small-gap records, leaving extras in the database. MSDN has a page on eliminating duplicates: http://support.microsoft.com/kb/139444

select 
studentid, min(startdate) as StartDate, max(enddate) as EndDate, field1, field2, 
datediff(dd, Min(endDate),max(startDate)) as MaxGap 
into #tempIDs
from #student 
group by studentid, field1, field2    

-- Update the relevant records.  Keeps two copies of the massaged record 
-- - extra will need to be deleted.

update #student 
set startdate = #TempIDS.startdate, enddate = #tempIDS.EndDate
from #tempIDS 
where #student.studentid = #TempIDs.StudentID and MaxGap < 2
CindyH
I believe that this breaks down if a student has more than 2 rows in the table and two of the rows are adjacent.
Tom H.
Probably it breaks down somewhere, but this should be a good starting point.
CindyH
+1  A: 

The following code should work. I've made a few assumptions as follows: there are no overlaps of date ranges, there are no NULL values in any of the fields, and the start date for a given row is always less than the end date. If your data doesn't fit these criteria, you'll need to adjust this method, but it should point you in the right direction.

You can use subqueries instead of the views, but that can be cumbersome so I used the views to make the code clearer.

CREATE VIEW dbo.StudentStartDates
AS
    SELECT
     S.StudentID,
     S.StartDate,
     S.Field1,
     S.Field2
    FROM
     dbo.Students S
    LEFT OUTER JOIN dbo.Students PREV ON
     PREV.StudentID = S.StudentID AND
     PREV.Field1 = S.Field1 AND
     PREV.Field2 = S.Field2 AND
     PREV.EndDate = DATEADD(dy, -1, S.StartDate)
    WHERE PREV.StudentID IS NULL
GO

CREATE VIEW dbo.StudentEndDates
AS
    SELECT
     S.StudentID,
     S.EndDate,
     S.Field1,
     S.Field2
    FROM
     dbo.Students S
    LEFT OUTER JOIN dbo.Students NEXT ON
     NEXT.StudentID = S.StudentID AND
     NEXT.Field1 = S.Field1 AND
     NEXT.Field2 = S.Field2 AND
     NEXT.StartDate = DATEADD(dy, 1, S.EndDate)
    WHERE NEXT.StudentID IS NULL
GO


SELECT
    SD.StudentID,
    SD.StartDate,
    ED.EndDate,
    SD.Field1,
    SD.Field2
FROM
    dbo.StudentStartDates SD
INNER JOIN dbo.StudentEndDates ED ON
    ED.StudentID = SD.StudentID AND
    ED.Field1 = SD.Field1 AND
    ED.Field2 = SD.Field2 AND
    ED.EndDate > SD.StartDate AND
    NOT EXISTS (SELECT * FROM dbo.StudentEndDates ED2 WHERE ED2.StudentID = SD.StudentID AND ED2.Field1 = SD.Field1 AND ED2.Field2 = SD.Field2 AND ED2.EndDate < ED.EndDate AND ED2.EndDate > SD.StartDate)
GO
Tom H.
A: 

Have you considered a non-equi join? That would look something like this:

SELECT A.StudentID, A.StartDate, A.EndDate, A.Field1, A.Field2
FROM tblEnrollment AS A LEFT JOIN tblEnrollment AS B ON (A.StudentID = B.StudentID) 
   AND (A.EndDate=B.StartDate-1)
WHERE B.StudentID Is Null;

What that gives you is all the records that don't have a corresponing record that starts the day after the ending date of the first record.

[Caveat: Beware that you can only edit a non-equi join in the Access query designer in SQL View -- switching to Design View could cause the join to be lost (though if you do switch Access tells you about the problem, and if you immediately switch back to SQL View, you won't lose it)]

If you then UNION that with this:

SELECT A.StudentID, A.StartDate, B.EndDate, A.Field1, A.Field2
FROM tblEnrollment AS A INNER JOIN tblEnrollment AS B ON (A.StudentID = B.StudentID) 
   AND (A.EndDate= B.StartDate-1)

It should give you what you need, assuming there are never more than two contiguous records at a time. I'm not sure how you'd do it if you had more than two contiguous records (it might involve looking at StartDate-1 compared to EndDate), but this might get you started in the right direction.

--
David W. Fenton
David Fenton Associates

David-W-Fenton
A: 

An alternate final query to the one provided by Tom H. in the accepted answer is

SELECT
    SD.StudentID,
    SD.StartDate,
    MIN(ED.EndDate),
    SD.Field1,
    SD.Field2
FROM
    dbo.StudentStartDates SD
INNER JOIN dbo.StudentEndDates ED ON
    ED.StudentID = SD.StudentID AND
    ED.Field1 = SD.Field1 AND
    ED.Field2 = SD.Field2 AND
    ED.EndDate > SD.StartDate
GROUP BY
    SD.StudentID, SD.Field1, SD.Field2, SD.StartDate

This also worked on all test data.

Eric Ness
A: 

This is a classic problem in SQL (the language) e.g. covered in Joe Celko's books 'SQL for Smarties" (chapter 23, Regions, Runs, Gaps, Sequences and Series) and his latest book "Thinking in Sets" (chapter 15).

While it's 'fun' to fix the data at run time with a monster query, for me this is one of those situations that can be better fixed off line and procedurally (personally I'd do it with formulas in an Excel spreadsheet).

The important thing is to put in place effective database constraints to prevent the overlapping periods reoccurring. Again, writing sequenced constraints in SQL is a classic: see Snodgrass (http://www.cs.arizona.edu/people/rts/tdbbook.pdf). Hint for MS Access users: you'll need to use CHECK constraints.

onedaywhen
A: 

If the number of rows that need to be eliminated is guaranteed to be relatively small (i.e. less than 100 in each contiguous group), recursion is the way to go. I outlined an example here.

A: 

Heres an example with test data using SQL Server 2005/2008 syntax.

DECLARE @Data TABLE(
    CalendarDate datetime )

INSERT INTO @Data( CalendarDate )
-- range start
SELECT '1 Jan 2010'
UNION ALL SELECT '2 Jan 2010'
UNION ALL SELECT '3 Jan 2010'
-- range start
UNION ALL SELECT '5 Jan 2010'
-- range start
UNION ALL SELECT '7 Jan 2010'
UNION ALL SELECT '8 Jan 2010'
UNION ALL SELECT '9 Jan 2010'
UNION ALL SELECT '10 Jan 2010'

SELECT DateGroup, Min( CalendarDate ) AS StartDate, Max( CalendarDate ) AS EndDate
FROM(   SELECT NextDay.CalendarDate, 
            DateDiff( d, RangeStart.CalendarDate, NextDay.CalendarDate ) - ROW_NUMBER() OVER( ORDER BY NextDay.CalendarDate ) AS DateGroup
        FROM( SELECT Min( CalendarDate ) AS CalendarDate
                FROM @data ) AS RangeStart
            JOIN @data AS NextDay
                ON NextDay.CalendarDate >= RangeStart.CalendarDate ) A
GROUP BY DateGroup
Daniel P