ansaurus

Question

Correct normalization of database with optional columns

Answer 1

+1 A:

Cohort
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL)
    comments (TEXT)

Parameters
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("systolic blood pressure", "trygliceride", ...)

CohortParameters
    id (INT, NOT NULL, PRIMARY)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameters.id)
    value (TEXT)

DistributionTypes
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("Triangular", "Weibull", ...)

Distributions
    id (INT, NOT NULL, PRIMARY)
    distribution_type_id (FOREIGN KEY referencing DistributionTypes.id)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameter.id)
    minimum (FLOAT)
    maximum (FLOAT)
    mean (FLOAT)
    mode (FLOAT)
    sd (FLOAT)
    ...other distribution parameters (alpha, beta, shape, scale, etc.)

chaos 2009-08-09 16:59:33

Thanks very much for your prompt response! I'm a little unclear about a couple of aspects of the solution though, specifically the CohortParameters table - what is its purpose and what purpose would the value column serve? Also, the Distributions table would still have the NULL values issue (although negligible wasted space aside, I still haven't convinced myself that this is genuinely a problem...). Thanks again for your input on this, Rich.

Rich Pollock 2009-08-09 21:17:48

Answer 2

A:

Having separate tables for different distribution types sounds right to me. In your application logic, you'll have to special-case each distribution type, anyway (I presume), as it may need different rendering in the UI, or different computations.

Martin v. Löwis 2009-08-09 17:00:00

Answer 3

A:

Your thought to have a table for each distribution type is probably what you want. That way, you have a well-defined table with each value you need specific to your distribution type. This will save you space, will allow you to lock down which fields are nullable and which are not, and will result in increased performance. If each distribution has a common set of parameters, you could arrange your tables in a supertype/subtype relationship to further normalize the schema.

Dave Markle 2009-08-09 17:02:36

Answer 4

A:

How will you use the data when you query it?

If you are querying a number of cohorts, and it's reasonable for the cohorts to have different distributions then your result would be a "union", where indeed many columns would be null. In which case your results are in some sense "not normal", but that doesn't mean that the schema should be.

The advantage of having different tables for different distributions types is that each table would explicit define the columns that must be populated to describe that distribution, you can even then set some columns to be "not null".

I like the general idea of your proposal.

djna 2009-08-09 17:03:37

Thanks very much for the reply (and also to Martin and Dave above). I won't be routinely querying multiple cohorts, so UNION won't be involved. I'm glad you agree with the "different table for different distribution" idea, but I've hit a problem with the implementation.In my middle table (associating cohorts with the distributions), I have the dist_ID stored in a column, but I only know which table this refers to by querying the dist_type column. As such, I can't use any of InnoDB's referential integrity features like cascade deletion. Any thoughts? Maybe another question is in order...

Rich Pollock 2009-08-09 21:29:42

Answer 5

+1 A:

Your design seems to indicate that there can only be one single type of distribution data per item of measured information. It seems impossible, in your design, to have both "even distribution" and "triangular distribution" data on, say, "systolic blood pressure".

This seems to indicate that for each individual piece of "measured information", you already know upfront, at system design time, what kind of distribution data is available.

This in turn seems to indicate that there is no need what so ever (and from a relational point of view it is outright bad to do so) to gather these different kinds of distribution in a single collection, only to reinstate any necessary distinction by adding a superfluous "distribution type" column.

EDIT

"The distribution type column also becomes necessary as soon as there are two or more cohorts in the database with differently distributed physiological parameters."

That seems crap. Distinct cohorts hold distinct distribution measurement IDs, and distinct distribution measurement IDs can be of different distribution types by your very own design.

2009-08-09 21:46:00

Each piece of measured information can only have one distribution *at any one time*, but the distribution type can be changed by the user through a web interface (depending on the clinical trial data they are using, for example). The distribution type column also becomes necessary as soon as there are two or more cohorts in the database with differently distributed physiological parameters. Hope that helps.

Rich Pollock 2009-08-09 21:57:47

"Distinct cohorts hold distinct distribution measurement IDs, and distinct distribution measurement IDs can be of different distribution types by your very own design." Why does this make the requirement for a distribution type column "crap"? I still need something to link each cohort characteristic to its distribution and if each distribution type is stored in a different table, then I need some way to identify which table it's in. While it's true that it *needn't* be in the Cohort table itself, I'm pretty sure that 3NF dictates that it should be there.

Rich Pollock 2009-08-10 06:36:30

ansaurus

tags:

views:

answers:

Correct normalization of database with optional columns

related questions