views:

538

answers:

3

My original question can be found here, for which I've gotten some great answers, idas and tips.

As part of a feasibility and performance study, I've started to convert my schemas in order to version my data using those ideas. In doing so, I've come up with some kind of other problem.

In my original question, my example was simple, with no real relational references. In an attempt to preserve the example of my previous question, I will now extend the 'Name' part to another table.

So now, my data becomes:

Person
------------------------------------------------
ID                UINT NOT NULL,
NameID            UINT NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL

PersonAudit
------------------------------------------------
ID                UINT NOT NULL,
NameID            UINT NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL,
UserID            UINT NOT NULL,         -- Who
PersonID          UINT NOT NULL,         -- What
AffectedOn        DATE NOT NULL,         -- When
Comment           VARCHAR(500) NOT NULL  -- Why

Name
------------------------------------------------
ID                UINT NOT NULL,
FirstName         VARCHAR(200) NOT NULL,
LastName          VARCHAR(200) NOT NULL,
NickName          VARCHAR(200) NOT NULL

NameAudit
------------------------------------------------
ID                UINT NOT NULL,
FirstName         VARCHAR(200) NOT NULL,
LastName          VARCHAR(200) NOT NULL,
NickName          VARCHAR(200) NOT NULL,
UserID            UINT NOT NULL,         -- Who
NameID            UINT NOT NULL,         -- What
AffectedOn        DATE NOT NULL,         -- When
Comment           VARCHAR(500) NOT NULL  -- Why

In a GUI, we could see the following form:

ID            :  89213483
First Name    :  Firsty
Last Name     :  Lasty
Nick Name     :  Nicky
Date of Birth :  January 20th, 2005
Email Address :  [email protected]

A change can be made to:

  1. Only to the 'name' part
  2. Only to the 'person' part
  3. To both the 'name' and person parts

If '1' occurs, we copy the original record to NameAudit and update our Name record with the changes. Since the person reference to the name is still the same, no changes to Person or PersonAudit are required.

If '2' occurs, we copy the original record to PersonAudit and update the Person record with the changes. Since the name part has not changed, no changes to Name or NameAudit are required.

If '3' occurs, we update our database according to the two methods above.

If we were to make 100 changes to both the person and name parts, one problem occurs when you later try to show a history of changes. All my changes show the person having the last version of the name. Which is wrong obviously.

In order to fix this, it would seem that the NameID field in Person should reference the NameAudit instead (but only if Name has changes).

And it is this conditional logic that starts complicating things.

I would be curious to find out if anyone has had this kind of problem before with their database and what kind of solution was applied?

A: 

Keep a single changes table with an autoincrement ID and make all your changes to refer to that table.

Always put the original record to the audit table.

To build a history, select all your changes and show the value closest to the change.

Like this:

`Change`

1
2
3
4
5
6

`NameAudit`

1 - created as John Smith
5 - changed to James Smith

`PersonAudit`

1 - created as born on `01.01.1980` in `Seattle, WA`
2 - changed DOB to '01.01.1980`
3 - changed DOB to '02.01.1980`
4 - changed DOB to '02.01.1980`
6 - changes POB to `Washington, DC`

Then select:

SELECT  c.id,
        (
        SELECT   MAX(id)
        FROM     NameAudit na
        WHER     na.id <= c.id
        ) as nameVersion,
        (
        SELECT   MAX(id)
        FROM     PersonAudit pa
        WHER     pa.id <= c.id
        ) as personVersion,
        na.*,
        pa.*
FROM    change c
JOIN    NameAudit na
ON      na.id = nameVersion
JOIN    PersonAudit pa
ON      pa.id = nameVersion
WHERE   change_id BETWEEN 1 AND 6
Quassnoi
+1  A: 

Jeach,

This article has some interesting ideas.

Robert Harvey
+3  A: 

You should probably try to read about 'Temporal Database' handling. Two books you could look at are Darwen, Date and Lorentzos "Temporal Data and the Relational Model" and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu. You can also look for "Allen's Relations" for intervals - they may be helpful to you.


I assume that the DATE type in your database includes time (so you probably use Oracle). The SQL standard type would probably be TIMESTAMP with some number of fractional digits for sub-second resolution. If your DBMS does not include time with DATE, then you face a difficult problem deciding how to handle multiple changes in a single day.

What you need to show, presumably, is a history of the changes in either table, with the corresponding values from the other table that were in force at the time when the changes were made. You also need to decide whether what you are showing is the before or after image; presumably, again, the after image. That means that you will have a 'sequenced' query (Snodgrass's term), with columns like:

Start time        -- When this set of values became valid
End time          -- When this set of values became invalid
PersonID          -- Person.ID (or PersonAudit.ID) for Person data
NameID            -- Name.ID (or NameAudit.ID) for Name data
DOB               -- Date of Birth recorded while data was valid
Email             -- Email address recorded while data was valid
FirstName         -- FirstName recorded while data was valid
LastName          -- LastName recorded while data was valid
NickName          -- NickName recorded while data was valid

I assume that once a Person.ID is established, it does not change; ditto for Name.ID. That means that they remain valid while the records do.

One of the hard parts in this is establishing the correct set of 'start time' and 'end time' values, since transitions could occur in either table (or both). I'm not even sure, at the moment, that you have all the data you need. When a new record is inserted, you don't capture the time it becomes valid (there is nothing in the XYZAudit table when you insert a new record, is there?).


There's a lot more could be said. Before going further, though, I'd rather have some feedback about some of the issues raised so far.


Some other SO questions that might help:

Jonathan Leffler
Thanks for the references, I look forward to exploring those!As for the DATE type, I'm currently using MySQL for my tests. I have written my SQL examples in some sort of pseudocode (due to my limited database knowledge). When using DATE, I meant it to mean date/time. Sorry about the informal SQL syntax. Same goes with UINT, it was meant as an unsigned integer.
Jeach
<i>When a new record is inserted, you don't capture the time it becomes valid (there is nothing in the XYZAudit table when you insert a new record, is there?).<i><p>Yes, I use the 'AffectedOn' to track the modification dates.
Jeach
I'm kinda starting to understand the idea behind 'temporal databases'. In your answer above, you have a 'start time' and 'end time'. Can you give me an added benefit of maintaining these values? I could also use a boolean 'isActive' variable in order to indicate which records are valid and invalid.
Jeach
@Jeach: so, do you insert into the audit table on all non-retrieval DML operations? If so, then you only need to look at the audit table to get the history - and the non-audit table to get the current version (but that's an optimization since the current version of the info is also in the audit table).
Jonathan Leffler
The advantage of a start time and end time in a single record is that you tell the range of validity of the record without looking at any other record. As soon as you drop one of those two, you have to look at the prior or next record to find the data, and that - believe me - makes the SQL harder. (It ain't easy to start with; you don't need it to be any harder.)
Jonathan Leffler
I do like your proposal of 'start time' and 'end time', since this would also allow me to perform queries like: "Give me all the updates performed in 2008". But how would one go about obtaining the CURRENT record? "Give me the record where EndTime = INFINITY"? Is that the only difference between the old records and the current record?
Jeach
Jonathan Leffler
Jonathan Leffler
It's probably time to start transferring some of these comments into the 'answer'. It's also time to get on with some real work, boo hiss! :( :D
Jonathan Leffler