views:

267

answers:

5

I am intentionally leaving this quite vague at first. I'm looking for discussion and what issues are important more than I'm looking for hard answers.

I'm in the middle of designing an app that does something like portfolio management. The design I have so far is

  • Problem: a problem that needs to be solved
  • Solution: a proposed solution to one or more problems
  • Relationship: a relationship among two problems, two solutions, or a problem and a solution. Further broken down into:
    • Parent-child - some sort of categorization / tree hierarchy
    • Overlap - the degree to which two solutions or two problems really address the same concept
    • Addresses - the degree to which a problem addresses a solution

My question is about the temporal nature of these things. Problems crop up, then fade. Solutions have an expected resolution date, but that might be modified as they are developed. The degree of a relationship might change over time as problems and solutions evolve.

So, the question: what is the best design for versioning of these things so I can get both a current and an historical perspective of my portfolio?

Later: perhaps I should make this a more specific question, though @Eric Beard's answer is worth an up.

I've considered three database designs. I'll enough of each to show their drawbacks. My question is: which to pick, or can you think of something better?

1: Problems (and separately, Solutions) are self-referential in versioning.

table problems
  int id | string name | text description | datetime created_at | int previous_version_id

  foreign key previous_version_id -> problems.id

This is problematic because every time I want a new version, I have to duplicate the entire row, including that long description column.

2: Create a new Relationship type: Version.

table problems
  int id | string name | text description | datetime created_at

This simply moves the relationship from the Problems and Solutions tables into the Relationships table. Same duplication problem, but perhaps a little "cleaner" since I already have an abstract Relationship concept.

3: Use a more Subversion-like structure; move all Problem and Solution attributes into a separate table and version them.

table problems
  int id

table attributes
  int id | int thing_id | string thing_type | string name | string value | datetime created_at | int previous_version_id

  foreign key (thing_id, thing_type) -> problems.id or solutions.id
  foreign key previous_version_id -> attributes.id

This means that to load the current version of a Problem or Solution I have to fetch all versions of the attribute, sort them by date and then use the most current. That might not be terrible. What seems really bad to me is that I can't type-check these attributes in the database. That value column has to be free-text. I can make the name column a reference into a separate attribute_names table that has a type column, but that doesn't force the correct type in the attributes table.

later still: response to @Eric Beard's comments about multi-table foreign keys:

Alas, what I've described is simplistic: there are only two types of Things (Problems and Solutions). I actually have about 9 or 10 different types of Things, so I'd have 9 or 10 columns of foreign keys under your strategy. I wanted to use single-table inheritance, but the Things have so little in common that it would be extremely wasteful to do combine them into one table.

+1  A: 

Hmm, sounds kind of like this site...

As far as a database design would go, a versioning system kind of like SVN, where you never actually do any updates, just inserts (with a version number) when things change, might be what you need. This is called MVCC, Multi-Value Concurrency Control. A wiki is another good example of this.

Eric Z Beard
+1  A: 

@Gaius

foreign key (thing_id, thing_type) -> problems.id or solutions.id

Be careful with these kinds of "multidirectional" foreign keys. My experience has shown that query performance suffers dramatically when your join condition has to check the type before figuring out which table to join on. It doesn't seem as elegant but nullable

problem_id and solution_id

will work much better.

Of course, query performance will also suffer with an MVCC design when you have to add the check to get the latest version of a record. The tradeoff is that you never have to worry about contention with updates.

Eric Z Beard
Voted up as those foreign keys that go to various tables so confuse the optimizer. It's also bad when you delete a parent record and for RI.
WW
A: 

I suppose there's

Option 4: the hybrid

Move the common Thing attributes into a single-inheritance table, then add an custom_attributes table. This makes foreign-keys simpler, reduces duplication, and allows flexibility. It doesn't solve the problems of type-safety for the additional attributes. It also adds a little complexity since there are two ways for a Thing to have an attribute now.

If description and other large fields stay in the Things table, though, it also doesn't solve the duplication-space problem.

table things
  int id | int type | string name | text description | datetime created_at | other common fields...
  foreign key type -> thing_types.id

table custom_attributes
  int id | int thing_id | string name | string value
  foreign key thing_id -> things.id
James A. Rosen
+1  A: 

How do you think about this:

table problems
int id | string name | text description | datetime created_at

table problems_revisions
int revision | int id | string name | text description | datetime created_at
foreign key id -> problems.id

Before updates you have to perform an additional insert in the revision table. This additional insert is fast, however, this is what you have to pay for

  1. efficient access to the current version - select problems as usual
  2. a schema that is intuitive and close to the reality you want to model
  3. joins between tables in your schema keep efficient
  4. using a revision number per busines transaction you can do versioning over table records like SVN does over files.
A: 

It's a good idea to choose a data structure that makes common questions that you ask of the model easy to answer. It's most likely that you're interested in the current position most of the time. On occasion, you will want to drill into the history for particular problems and solutions.

I would have tables for problem, solution, and relationship that represent the current position. There would also be a problem_history, solution_history, etc table. These would be child tables of problem but also contain extra columns for VersionNumber and EffectiveDate. The key would be (ProblemId, VersionNumber).

When you update a problem, you would write the old values into the problem_history table. Point in time queries are therefore possible as you can pick out the problem_history record that is valid as-at a particular date.

Where I've done this before, I have also created a view to UNION problem and problem_history as this is sometimes useful in various queries.

Option 1 makes it difficult to query the current situation, as all your historic data is mixed in with your current data.

Option 3 is going to be bad for query performance and nasty to code against as you'll be accessing lots of rows for what should just be a simple query.

WW