views:

141

answers:

5

Excuse the long question!

We have two database tables, e.g. Car and Wheel. They are related in that a wheel belongs to a car and a car has multiple wheels. The wheels, however, can be changed without affecting the "version" of the car. The car's record can be updated (e.g. paint job) without affecting the version of the wheels (i.e. no cascade updating).

For example, Car table currently looks like this:

CarId, CarVer, VersionTime, Colour
   1      1       9:00       Red
   1      2       9:30       Blue
   1      3       9:45       Yellow
   1      4      10:00       Black

The Wheels table looks like this (this car only has two wheels!)

WheelId, WheelVer, VersionTime, CarId
   1         1           9:00     1
   1         2           9:40     1
   1         3          10:05     1
   2         1           9:00     1

So, there's been 4 versions of this two wheeled car. It's first wheel (WheelId 1) hasn't changed. The second wheel was changed (e.g. painted) at 10:05.

How do I efficiently do as of queries that can be joined to other tables as required? Note that this is a new database and we own the schema and can change it or add audit tables to make this query easier. We've tried one audit table approach (with columns: CarId, CarVersion, WheelId, WheelVersion, CarVerTime, WheelVerTime), but it didn't really improve our query.

Example query: Show the Car ID 1 as it was, including its wheel records as of 9:50. This query should result in these two rows being returned:

WheelId, WheelVer, WheelVerTime, CarId, CarVer, CarVerTime, CarColour
   1         2         9:40        1       3       9:45      Yellow
   2         1         9:00        1       3       9:45      Yellow

The best query we could come up with was this:

select c.CarId, c.VersionTime, w.WheelId,w.WheelVer,w.VersionTime,w.CarId
from Cars c, 
(    select w.WheelId,w.WheelVer,w.VersionTime,w.CarId
    from Wheels w
    where w.VersionTime <= "12 Jun 2009 09:50" 
     group by w.WheelId,w.CarId
     having w.WheelVer = max(w.WheelVer)
) w
where c.CarId = w.CarId
and c.CarId = 1
and c.VersionTime <= "12 Jun 2009 09:50" 
group by c.CarId, w.WheelId,w.WheelVer,w.VersionTime,w.CarId
having c.CarVer = max(c.CarVer)

And, if you wanted to try this then the create table and insert record SQL is here:

create table Wheels
(
WheelId int not null,
WheelVer int not null,
VersionTime datetime not null,
CarId int not null,
 PRIMARY KEY  (WheelId,WheelVer)
)
go

insert into Wheels values (1,1,'12 Jun 2009 09:00', 1)
go
insert into Wheels values (1,2,'12 Jun 2009 09:40', 1)
go
insert into Wheels values (1,3,'12 Jun 2009 10:05', 1)
go
insert into Wheels values (2,1,'12 Jun 2009 09:00', 1)
go


create table Cars
(
CarId int not null,
CarVer int not null,
VersionTime datetime not null,
colour varchar(50) not null,
 PRIMARY KEY  (CarId,CarVer)
)
go

insert into Cars values (1,1,'12 Jun 2009 09:00', 'Red')
go
insert into Cars values (1,2,'12 Jun 2009 09:30',  'Blue')
go
insert into Cars values (1,3,'12 Jun 2009 09:45',  'Yellow')
go
insert into Cars values (1,4,'12 Jun 2009 10:00',  'Black')
go
+1  A: 

As-of queries are easier when each row has a start and an end time. Storing the end time in the table would be most efficient, but if this is hard, you can query it like:

select 
    ThisCar.CarId
,   StartTime = ThisCar.VersionTime
,   EndTime = NextCar.VersionTime
from Cars ThisCar
left join Cars NextCar
    on NextCar.CarId = ThisCar.CarId
    and ThisCar.VersionTime < NextCar.VersionTime
left join Cars BetweenCar
    on BetweenCar.CarId = BetweenCar.CarId
    and ThisCar.VersionTime < BetweenCar.VersionTime
    and BetweenCar.VersionTime < NextCar.VersionTime
where BetweenCar.CarId is null

You can store this in a view. Say the view is called vwCars, you can select a car for a particular date like:

select * 
from vwCars
where StartTime <= '2009-06-12 09:15' 
and ('2009-06-12 09:15' < EndTime or EndTime is null)

You could store this in a table valued stored procedure, but that might have a steep performance penalty.

Andomar
Your query is more efficient (less table scans), but doesn't perform an as-of query. Your query is only getting the latest version, rather than the version as-of 09:50. We might be able to take some ideas from your query though, so thanks.
ng5000
We won't be able to use views as we'll need to pass the time component of the query into the query. SPs may be an option, but with having to join to other tables we might need to look at table functions
ng5000
Edited with new approach for as-of dates.
Andomar
You're query isn't pulling back the results I wanted as per my question - thanks anyway
ng5000
+1  A: 

Depending on your application you might want to push the versioning to secondary auditing tables, that would have both a start and a nullable end date. I found in a high trafic OLTP that using the versioning approach can become fairly expensive and if most of your reads pull the latest version then this might be beneficul.

By using a start and end date you can query the ancillary tables looking for a date that is between start and stop or greater then start.

JoshBerke
+1  A: 

Storing the end time in the table for each situation makes the queries indeed easier to express, but creates the problem of maintaining integrity rules such as "no two distinct situations for the same car (wheel/...) may overlap" (still reasonably doable) and "there cannot be holes in the timeseries of distinct situations of any single (car/wheel/...)" (more troublish).

Not storing the end time in the table for each situation forces you to write self-joins each time you need to invoke an Allen operator (overlaps, merges, contains, ...) on the time intervals implied by the only time column you have.

SQL is just a nightmare if you need to do this kind of temporal stuff.

And incidentally, even just accurately formulating these queries in natural language is a nightmare. To illustrate : you said that you needed "as-of" queries, but your examples excluded the situations which were "as-of" 10:05 (wheelVer 3) and 10:00 (color black). This despite the fact that those situations are definitely also "as-of" 09:50.

You may be interested in a read of "Temporal Data and the Relational Model". Keep in mind that the treatment in this book is entirely abstract, since, as the book itself says, "this book is not about technology available anywhere today".

The other standard textbook on the subject (I'm told), is one by Snodgrass, but I don't know the title. I'm told the authors of these two books take completely opposite stances as to what the solution ought to be.

+3  A: 

This kind of table is known as a valid-time state table in the literature. It is universally accepted that each row should model a period by having a start date and an end date. Basically, the unit of work in SQL is the row and a row should completely define the entity; by having just one date per row, not only do your queries become more complex, your design is compromised by splitting sub atomic parts on to different rows.

As mentioned by Erwin Smout, one of the definitive books on the subject is:

Richard T. Snodgrass (1999). Developing Time-Oriented Database Applications in SQL

It's out of print but happily is available as a free download PDF (link above).

I have actually read it and have implemented many of the concepts. Much of the text is in ISO/ANSI Standard SQL-92 and although some have been implemented in proprietary SQL syntaxes, including SQL Server (also available as downloads) I found the conceptual information much more useful.

Joe Celko also has a book, 'Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL', largely derived from Snodgrass's work, though I have to say where the two diverge I find Snodgrass's approaches preferable.

I concur this stuff is hard to implement in the SQL products we currently have. We think long and hard before making data temporal; if we can get away with merely 'historical' then we will. Much of the temporal functionality in SQL-92 is missing from SQL Server e.g. INTERVAL, OVERLAPS, etc. Some things as fundamental as sequenced 'primary keys' to ensure periods do not overlap cannot be implemented using CHECK constraints in SQL Server, necessitating triggers and/or UDFs.

Snodgrass's book is based on his work for SQL3, a proposed extension to Standard SQL to provide much better support for temporal databases, though sadly this seems to have been effectively shelved years ago :(

onedaywhen
+1  A: 

This query will return duplicates if you have two rows with the same exact version time for a single car ID, but that's a matter of defining what you consider to be the "latest" one in that situation. I haven't had a chance to test this yet, but I think it will give you what you need. It's at least pretty close.

SELECT
     C.car_id,
     C.car_version,
     C.colour,
     C.version_time AS car_version_time,
     W.wheel_id,
     W.wheel_version,
     W.version_time AS wheel_version_time,
FROM
     Cars C
LEFT OUTER JOIN Cars C2 ON
     C2.car_id = C.car_id AND
     C2.version_time <= @as_of_time AND
     C2.version_time > C.version_time
LEFT OUTER JOIN Wheels W ON
     W.car_id = C.car_id AND
     W.version_time <= @as_of_time
LEFT OUTER JOIN Wheels W2 ON
     W2.car_id = C.car_id AND
     W2.wheel_id = W.wheel_id AND
     W2.version_time <= @as_of_time AND
     W2.version_time > W.version_time
WHERE
     C.version_time <= @as_of_time AND
     C2.car_id IS NULL AND
     W2.wheel_id IS NULL
Tom H.
A few minor changes to unify naming (e.g. car_id to CarId) and your query works.
ng5000