I have some rather dodgy data I'm looking at, and I've been tasked with finding out for how many consecutive months something has been going on for. However I have noitced that in some of my reference tables individual date values are being miscoded, and where there are multiple values in one month, the latest should be added into the next month.
For example:
Contact _id | Original Payment Date
id001 | 02/07/2003
id001 | 30/07/2003 --should be changed to 30/08/2003
id001 | 01/09/2003
id001 | 01/10/2003
id001 | 30/10/2003 -- should be changed to 30/11/2003
id001 | 02/12/2003
id001 | 31/12/2003 -- should be changed to 31/01/2004
id001 | 30/01/2004
id001 | 03/03/2004
However there are two problems I'm finding with a simply DATEADD function:
1) If a payment needs to be moved into a following month and has a date greater than the next month allows (i.e. 31/01/2003 cannot go to 31/02/2003) I'm not sure how the dateadd would work in this case
2) If in the example above i make those changes, we have the following data:
id001 | 02/07/2003
id001 | 30/08/2003
id001 | 01/09/2003
id001 | 01/10/2003
id001 | 30/11/2003
id001 | 02/12/2003
id001 | 31/01/2004 --This should now be changed to a value in Februrary 2004 as there are
now duplicates in January 2004 created by the previous amendment
id001 | 30/01/2004
id001 | 03/03/2004
Although I believe that 2 'loops' of changes would make sure that all data is correct I cannot be sure, and so I would really like some way of pushing the latest date in a month (where there are 2 values for that month) forward a month that repeats until there are no more duplicates if possible.
I am using sql server 2005 and the reference table has around 20 million rows, so I'd prefer not to use cursors if possible :)
Thanks!
update The script I am using to update the dates the first time is this :
;WITH cte1 AS (
SELECT contact_id
,value_net
,DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received) -
DENSE_RANK() OVER
(PARTITION BY contact_id
ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS dategroup
,DENSE_RANK() OVER
(PARTITION BY contact_id
ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS rnk
,ROW_NUMBER() OVER
(PARTITION BY contact_id
ORDER BY DATEPART(YEAR, date_received)*12 + DATEPART(MONTH, date_received)) AS rnk2
,date_received
FROM donation with (nolock)
WHERE contact_id IS NOT NULL
)
,cte2 AS
(
SELECT
c1.contact_id
,c1.value_net
,c1.dategroup
,CASE WHEN c1.rnk = c2.rnk AND c1.rnk2 > c2.rnk2 THEN DATEADD(MM,+1,c1.date_received) ELSE c1.date_received END as date_received
from cte1 c1
LEFT OUTER JOIN cte1 c2 WITH (nolock) ON c2.contact_id = c1.contact_id AND c2.rnk = c1.rnk AND c2.rnk2 = c1.rnk2-1
)