views:

39

answers:

1

Problem statement: A table contains an item_id, a category_id and a date range (begin_date and end_date).

No item may be in more than one category on any given date (in general; during daily rebuilding it can be in an invalid state temporarily).

By default, all items are added (and re-added if removed) to a category (derived from outside data) automatically on a daily basis, and their membership in that category matches the lifespan of the item (items have their own begin and end date, and usually spend their entire lives in the same category, which is why this matches).

For items in category X, it is occasionally desirable to override the default category by adding them to category Y. Membership in category Y could entirely replace membership in category X (that is, the begin and end dates for membership in category Y would match the begin and end dates of the item itself), or it could override it for an arbitrary period of time (at the beginning, middle or end the item's lifespan, possibly overriding for short periods at multiple times). Membership in category Y is not renewed automatically and additions to that category is done by manual data entry.

Every day, when category X is rebuilt, we get an overlap, where any item in category Y will now be in category X as well (which is forbidden, as noted previously).

Goal: After each repopulation of category X (which is done in a rather complicated and fragile manner, and ideally would be left alone), I'm trying to find an efficient means of writing a stored procedure that:

  1. Identifies the overlaps

  2. Changes existing entries, adds new ones where necessary (such as in the case where an item starts in category X, switches to category Y, then eventually switches back to category X before ending), or removes entries (when an item is in category Y for its entire life) such that every item remains in category Y (and only Y) where specified, while category X membership is maintained when not overridden by category Y.

  3. Does not affect memberships of categories A, B, C, Z, etc., which do not have override categories and are built separately, by completely different rules.

  4. Note: It can be assumed that X membership covers the entire lifespan of the item before this procedure is called, so it is unnecessary to query any data outside this table.

  5. Bonus credit: If for some reason there are two adjacent or overlapping memberships in for the same item in category Y, stitching them together into a single entry is appreciated, but not necessary.

Example:

  • item_id category_id begin_date end_date
  • 1 X 20080101 20090628
  • 1 Y 20090101 20090131
  • 1 Y 20090601 20090628
  • 2 X 20080201 20080731
  • 2 Y 20080201 20080731

Should become:

  • item_id category_id begin_date end_date
  • 1 X 20080101 20081231
  • 1 Y 20090101 20090131
  • 1 X 20090201 20090531
  • 1 Y 20090601 20090628
  • 2 Y 20080201 20080731

If it matters, this needs to work on SQL Server 2005 and SQL Server 2008

+1  A: 

Given an item that should be in category X, once the category Y rows are already there you should be able to add the X with the code below. You'll need to adapt it to however you will be using it - in a cursor, as a set-based operation for all items with a category X row, or whatever.

DECLARE
    @y_id INT,
    @x_id INT,
    @item_id INT
SELECT @y_id = category_id FROM dbo.Categories WHERE name = 'Y'
SELECT @x_id = category_id FROM dbo.Categories WHERE name = 'X'
SET @item_id = ???

;WITH Begin_Dates AS (SELECT
    BI.item_id,
    BI.begin_date AS begin_date
FROM
    Items BI
WHERE
    NOT EXISTS (SELECT * FROM Item_Categories BIC2 WHERE BIC2.item_id = BI.item_id AND BIC2.category_id = @y_id AND BIC2.begin_date = BI.begin_date)
UNION
SELECT
    BIC1.item_id,
    end_date AS begin_date
FROM
    Item_Categories BIC1
WHERE
    BIC1.category_id = @y_id AND
    NOT EXISTS (SELECT * FROM dbo.Items WHERE item_id = BIC1.item_id AND BIC1.end_date = end_date)),
End_Dates AS (SELECT
    EI.item_id,
    EI.end_date AS end_date
FROM
    Items EI
WHERE
    NOT EXISTS (SELECT * FROM Item_Categories EIC2 WHERE EIC2.item_id = EI.item_id AND EIC2.category_id = @y_id AND EIC2.begin_date = EI.begin_date)
UNION
SELECT
    EIC1.item_id,
    begin_date AS end_date
FROM
    Item_Categories EIC1
WHERE
    EIC1.category_id = @y_id AND
    NOT EXISTS (SELECT * FROM dbo.Items WHERE item_id = EIC1.item_id AND EIC1.begin_date = begin_date))
INSERT INTO dbo.Item_Categories
(
    item_id,
    category_id,
    begin_date,
    end_date
)
SELECT
    @item_id,
    @x_id,
    BD.begin_date,
    ED.end_date
FROM
    Begin_Dates BD
INNER JOIN End_Dates ED ON
    ED.end_date > BD.begin_date AND
    NOT EXISTS (SELECT * FROM End_Dates ED2 WHERE ED2.end_date > BD.begin_date AND ED2.end_date < ED.end_date)
Tom H.
I am looking at this solution. It's got some problems with it (some of which are because you used outside information which doesn't hold true precisely in the actual DB). I'll see what I can do with it though. Thank you.
ShadowRanger
Once I figured out what you were doing (it was a little tricky since you mixed birth dates with begin dates, which was true in my simplified scenario, but wrong in the real world scenario), I was able to get it working for my specific scenario and applicable to a set of "item_ids" instead of just one. Thank you very much.
ShadowRanger
One minor correction: You used the item_categories end_date and begin_date unmodified in your WITH setup. While we can use the item begin and end dates unmodified (since it must be a member of a category at the begin and end of its life), we can't do the same for category membership dates, or we'll have a one day overlap every time we switch between the override and the default category (category membership is inclusive on both begin and end dates). Changing to "DATEADD(dy, 1, end_date) AS begin_date" and "DATEADD(dy, -1, begin_date) AS end_date" in each category selection fixed that though.
ShadowRanger
To be fair, your way would work if we were exclusive on the end_date. I have no idea what "standard" DB setup is considered to be, but inclusive on both ends seems to be most intuitive (when working with dates and ignoring times) and that was the approach used in our DB.
ShadowRanger
One last note: Sorry for the delayed response. The task got backburnered until last Tuesday, and I only figured out how to get it working that afternoon. Then I went out of town for the rest of the week and didn't get a chance to respond here until now. Again, thanks for the help!
ShadowRanger