views:

232

answers:

8

I want to have a database table that keeps data with revision history (like pages on Wikipedia). I thought that a good idea would be to have two columns that identify the row: (name, version). So a sample table would look like this:

TABLE PERSONS:
    id:      int,
    name:    varchar(30),
    version: int,
    ... // some data assigned to that person.

So if users want to update person's data, they don't make an UPDATE -- instead, they create a new PERSONS row with the same name but different version value. Data shown to the user (for given name) is the one with highest version.

I have a second table, say, DOGS, that references persons in PERSONS table:

TABLE DOGS:
    id:         int,
    name:       varchar(30),
    owner_name: varchar(30),
    ...

Obviously, owner_name is a reference to PERSONS.name, but I cannot declare it as a Foreign Key (in MS SQL Server), because PERSONS.name is not unique!

Question: How, then, in MS SQL Server 2008, should I ensure database integrity (i.e., that for each DOG, there exists at least one row in PERSONS such that its PERSON.name == DOG.owner_name)?

I'm looking for the most elegant solution -- I know I could use triggers on PERSONS table, but this is not as declarative and elegant as I want it to be. Any ideas?


Additional Information

The design above has the following advantage that if I need to, I can "remember" a person's current id (or (name, version) pair) and I'm sure that data in that row will never be changed. This is important e.g. if I put this person's data as part of a document that is then printed and in 5 years someone might want to print a copy of it exactly unchanged (e.g. with the same data as today), then this will be very easy for them to do.

Maybe you can think of a completely different design that achieves the same purpose and its integrity can be enforced easier (preferably with foreign keys or other constraints)?


Edit: Thanks to Michael Gattuso's answer, I discovered another way this relationship can be described. There are two solutions, which I posted as answers. Please vote which one you like better.

A: 

I would use and association table to link the many versions to the one pk.

Preet Sangha
Could you explain this in more detail? Is your solution similar to Mark's?
DzinX
yes. A another table link resolves the many-many link
Preet Sangha
A: 

Okay, first thing is that you need to normalize your tables. Google "database normalization" and you'll come up with plenty of reading. The PERSONS table, in particular, needs attention.

Second thing is that when you're creating foreign key references, 99.999% of the time you want to reference an ID (numeric) value. I.e., [DOGS].[owner] should be a reference to [PERSONS].[id].

Edit: Adding an example schema (forgive the loose syntax). I'm assuming each dog has only a single owner. This is one way to implement Person history. All columns are not-null.

Persons Table:
int Id
varchar(30) name
...

PersonHistory Table:
int Id
int PersonId (foreign key to Persons.Id)
int Version (auto-increment)
varchar(30) name
...

Dogs Table:
int Id
int OwnerId (foreign key to Persons.Id)
varchar(30) name
...

The latest version of the data would be stored in the Persons table directly, with older data stored in the PersonHistory table.

Jon Seigel
First, as for normalization, OK, I can throw away "PERSONS.id", but that won't solve my problem at all. Second, if DOGS.owner would reference PERSONS.id, it would point to a person with fixed version, not the latest, which is exactly what I want to avoid.
DzinX
@DzinX: Please do the Google search I suggested, as by your comment it's clear you don't understand what normalization means.
Jon Seigel
Ouch, that hurt! FYI, I understand very well what normalization means. If you have an idea how to solve my problem, please explain it in more detail and maybe provide at least a part of a solution.
DzinX
@DzinX: "First, as for normalization, OK, I can throw away "PERSONS.id", ..." -- this is not normalization.
Jon Seigel
It is, as the (name, version) pair implies id and vice versa, so I can discard either id or version (because name is referenced by DOGS table). I want to keep the version for reasons explained in my post, so the only thing I can discard is id. But this doesn't help me at all! I can't see the solution you have in mind. What normalization would you apply to my tables and how would it help?
DzinX
@DzinX: Unless you've omitted something in your original post, there is no reason to believe (name, version) pairs are unique.
Jon Seigel
Quote: "I thought that a good idea would be to have two columns that identify the row: (name, version)." That aside, from what I see, your solution is similar to Mark's and suffers from the same problem: there's no guarantee that there exists at least one PersonVersion for each Dog (linked through Persons table). Can this be fixed somehow?
DzinX
@DzinX: Edited my answer to move to a different structure of how history is stored. The two solutions posted aren't good because you'll have dangling data generated by avoiding the circular foreign key reference. This would be the only way to enforce your constraint at a database level. If you can constrain the data on an application level, my previous edit will work fine. (Note: the constraint reduces to ensuring a history exists for each person, since a dog is guaranteed to have an owner by the not-null foreign key in that table.)
Jon Seigel
+1  A: 

Alternatively you could maintain a person history table for the data that has historic value. This way you keep your Persons and Dogs table tidy and the references simple but also have access to the historically interesting information.

Michael Gattuso
Yeah, I could do that, but then I should keep the current data in two copies: one in CurrentPerson, and one (with highest version) in PersonHistory, so that I don't have to switch between tables when pointing to a person from a document to be printed. This means that I have to make sure that these two entries contain the same data... that's it! Your solution gave me a good idea! Thanks! I'll post it as a separate answer.
DzinX
A: 

A project I have worked on addressed a similar problem. It was a biological records database where species names can change over time as new research improved understanding of taxonomy.

However old records needed to remain related to the original species names. It got complicated but the basic solution was to have a NAME table that just contained all unique species names, a species table that represented actual species and a NAME_VERSION table that linked the two together. At any one time there would be a preferred name (ie the currently accepted scientific name for the species) which was a boolean field held in name_version.

In your example this would translate to a Details table (detailsid, otherdetails columns) a link table called DetailsVersion (detailsid, personid) and a Person Table (personid, non-changing data). Relate dogs to Person.

andyb
Hmm from what I see, I now have four tables and still no guarantee that there exists a Details row that corresponds to a Dog row. Am I missing something?
DzinX
A: 

Persons

id (int),
name,
.....
activeVersion (this will be UID from personVersionInfo)

note: Above table will have 1 row for each person. will have original info with which person was created.

PersonVersionInfo

UID (unique identifier to identify person + version),
id (int),
name,
.....
versionId (this will be generated for each person)

Dogs

DogID,
DogName
......

PersonsWithDogs

UID,
DogID

EDIT: You will have to join PersonWithDogs, PersionVersionInfo, Dogs to get the full picture (as of today). This kind of structure will help you link a Dog to the Owner (with a specific version).

In case the Person's info changes and you wish to have latest info associated with the Dog, you will have to Update PersonWithDogs table to have the required UID (of the person) for the given Dog.

You can have restrictions such as DogID should be unique in PersonWithDogs.
And in this structure, a UID (person) can have many Dogs.

Your scenarios (what can change/restrictions etc) will help in designing the schema better.

shahkalpesh
Also a dog can have multiple owners which also happens.
HLGEM
@HLGEM: Yes and that is the reason, I said "Your scenarios (what can change/restrictions etc) will help in designing the schema better" :)
shahkalpesh
I don't like the way you have to update PersonsWithDogs each time you change Person's info. This way, I could have my 2 starting tables and have DOGS.owner_id instead of DOGS.owner_name and update this id on each person's change. This can be costly if a person has many dogs.
DzinX
A: 

Thanks to Michael Gattuso's answer, I discovered another way this relationship can be described. There are two solutions, this is the first of them. Please vote which one you like better.

Solution 1

In PERSONS table, we leave only the name (unique identifier) and a link to current person's data:

TABLE PERSONS:
    name:            varchar(30),
    current_data_id: int

We create a new table, PERSONS_DATA, that contains all data history for that person:

TABLE PERSONS_DATA:
    id:      int
    version: int (auto-generated)
    ... // some data, like address, etc.

DOGS table stays the same, it still points to a person's name (FK to PERSONS table).

ADVANTAGE: for each dog, there exists at least one PERSONS_DATA row that contains data of its owner (that's what I wanted)

DISADVANTAGE: if you want to change a person's data, you have to:

  1. add a new PERSONS_DATA row
  2. update PERSONS entry for this person to point to the new PERSONS_DATA row.
DzinX
A: 

Thanks to Michael Gattuso's answer, I discovered another way this relationship can be described. There are two solutions, this is the second of them. Please vote which one you like better.

Solution 2

In PERSONS table, we leave only the name (unique identifier) and a link to the first (not current!) person's data:

TABLE PERSONS:
    name:            varchar(30),
    first_data_id: int

We create a new table, PERSONS_DATA, that contains all data history for that person:

TABLE PERSONS_DATA:
    id:      int
    name:    varchar(30)
    version: int (auto-generated)
    ... // some data, like address, etc.

DOGS table stays the same, it still points to a person's name (FK to PERSONS table).

ADVANTAGES:

  • for each dog, there exists at least one PERSONS_DATA row that contains data of its owner (that's what I wanted)
  • if I want to change a person's data, I don't have to update the PERSONS row, only add a new PERSONS_DATA row

DISADVANTAGE: to retrieve current person's data, I have to either:

  • choose PERSONS_DATA with given name and highest version (may be expensive)
  • choose PERSONS_DATA with special version, e.g. "-1", but then I would have to update two PERSONS_DATA rows each time I add new PERSONS_DATA, and in this solution I wanted to avoid having to update 2 rows...

What do you think?

DzinX
+2  A: 

In your parent table, create a unique constraint on (id, version). Add version column to your child table, and use a check constraint to make sure that it is always 0. Use a FK constraint to map (parentid, version) to your parent table.

AlexKuznetsov
Wonderful! A little too much data in child table, but definitely worth it -- brings simplicity and order to the design. Thanks!
DzinX