views:

36

answers:

1

I have some rather dodgy data I'm looking at, and I've been tasked with finding out for how many consecutive months something has been going on for. However I have noitced that in some of my reference tables individual date values are being miscoded, and where there are multiple values in one month, the latest should be added into the next month.

For example:

Contact _id  | Original Payment Date 
id001        |  02/07/2003
id001        |      30/07/2003 --should be changed to 30/08/2003
id001        |      01/09/2003
id001        |      01/10/2003 
id001        |      30/10/2003 -- should be changed to 30/11/2003
id001        |      02/12/2003
id001        |      31/12/2003 -- should be changed to 31/01/2004
id001        |      30/01/2004
id001        |      03/03/2004

However there are two problems I'm finding with a simply DATEADD function:

1) If a payment needs to be moved into a following month and has a date greater than the next month allows (i.e. 31/01/2003 cannot go to 31/02/2003) I'm not sure how the dateadd would work in this case

2) If in the example above i make those changes, we have the following data:

id001        |      02/07/2003
id001        |      30/08/2003
id001        |      01/09/2003
id001        |      01/10/2003
id001        |      30/11/2003
id001        |      02/12/2003
id001        |      31/01/2004 --This should now be changed to a value in Februrary 2004 as there are
 now duplicates in January 2004 created by the previous amendment
 id001       |     30/01/2004
id001        |      03/03/2004

Although I believe that 2 'loops' of changes would make sure that all data is correct I cannot be sure, and so I would really like some way of pushing the latest date in a month (where there are 2 values for that month) forward a month that repeats until there are no more duplicates if possible.

I am using sql server 2005 and the reference table has around 20 million rows, so I'd prefer not to use cursors if possible :)

Thanks!

update The script I am using to update the dates the first time is this :

;WITH cte1 AS (  
SELECT  contact_id
        ,value_net
        ,DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received) -  
        DENSE_RANK() OVER  
                   (PARTITION BY contact_id  
                    ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS dategroup
        ,DENSE_RANK() OVER  
                   (PARTITION BY contact_id  
                    ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS rnk
        ,ROW_NUMBER() OVER  
                   (PARTITION BY contact_id  
                    ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS rnk2

        ,date_received 
FROM    donation with (nolock)
WHERE contact_id IS NOT NULL
 )
,cte2 AS
(
SELECT 
c1.contact_id
,c1.value_net
,c1.dategroup
,CASE WHEN c1.rnk = c2.rnk AND c1.rnk2 > c2.rnk2 THEN DATEADD(MM,+1,c1.date_received) ELSE c1.date_received END as date_received
from cte1 c1
LEFT OUTER JOIN cte1 c2 WITH (nolock) ON c2.contact_id = c1.contact_id AND c2.rnk = c1.rnk AND c2.rnk2 = c1.rnk2-1
)
+1  A: 

1) DATEADD() with MONTH of 1 to 2010-01-31 will result in 2010-02-28

2) A problem is going to be duplication and identifying a specific row within duplicates. For instance, you want to shift all but the first, but what if there are multiple first, they won't get shifted - i.e. you are doing something like this:

UPDATE dates
SET dt = DATEADD(MONTH, 1, dt)
WHERE YEAR(dt) * 100 + MONTH(dt) IN (
    SELECT YEAR(dt) * 100 + MONTH(dt)
    FROM dates
    GROUP BY YEAR(dt), MONTH(dt)
    HAVING COUNT(*) > 1
)
AND dt NOT IN (
    SELECT MIN(dt)
    FROM dates
    GROUP BY YEAR(dt), MONTH(dt)
    HAVING COUNT(*) > 1
)

You could probably modify this to use ROW_NUMBER() OVER() to identify rows uniquely.

Cade Roux
@Cade Thanks - I've now included some sample code to show how I am defining which row should have a month added, which works fine (I'm simply using the latest date where there is a duplicate month), and I could repeat this CTE a few time to get my results, but I'm sure there's a better way of doing this than that!
Davin
@Cade - additionally (as mad as it sounds) I don't want to actually permanently change the data that is wrong (as I need the original format to determine what's causing the problems), and so for now a solution that can simply change the query output would be better
Davin
@Davin - I would still consider making a copy of the data (even though it is 20m rows) and shifting - simple brute force attack and you're done. In addition, you may be able to take a more holistic approach to what is meant by consecutive. Obviously any number of donations cause the period to be extended. It's possible for 5 payments to be made, then none for 4 months, then monthly payments to be made again. There would be no gap. So instead of looking for consecutive, look for gaps - i.e. if total number of payments < total number of months a gap must exist, then attack with subdivision.
Cade Roux
@cade, thanks for that, didn't think of situations where that might be a problem, will have to go back and think of the best way to approach this :)
Davin