views:

272

answers:

5

I need to create a database table that stores parametric descriptions of physiological characteristics (e.g. systolic blood pressure, triglyceride concentrations, etc.) of a hypothetical cohort of patients.

For example, supposing the user specifies a triangular distribution for SBP, then the minimum, maximum and mode (and distribution type) would have to be stored. Alternatively, the user may specify a normal distribution, requiring storage of the mean and standard deviation.

I'm struggling with the correct way to normalize these data. Currently, I have a Cohort table and a Distribution table with a number of one-to-one relationships as follows (some fields omitted):

    Cohort
        id (INT, NOT NULL, PRIMARY)
        name (TEXT, NOT NULL)
        comments (TEXT)
        systolic_blood_pressure_dist (FOREIGN KEY referencing Distributions.id)
        triglyceride_dist (FOREIGN KEY referencing Distributions.id)
        ...other physiological parameters

    Distributions
        id (INT, NOT NULL, PRIMARY)
        distribution_type (TEXT)
        minimum (FLOAT)
        maximum (FLOAT)
        mean (FLOAT)
        mode (FLOAT)
        sd (FLOAT)
        ...other distribution parameters (alpha, beta, shape, scale, etc.)

(distribution_type holds a string describing the distribution: "Triangular", "Weibull", etc.)

I'm pretty sure this is not the optimum way to do this as I'm left with loads of NULL fields in each row of Distributions.

My other thought was to have separate tables for each distribution type (one for triangular, one for Gaussian, one for uniform, etc.) and have a table in the middle with an id column (to be used as a foreign key in the Cohort table *_dist columns), a distribution type column and an id column to store the foreign key for the row in the appropriate distribution table.

The query would use the id stored in the Cohort column to find the distribution type and row id from the middle table, then lookup the parameters in the appropriate table using the id. However, using a string to select the appropriate table, then another id to select the appropriate row is far from a traditional JOIN and also doesn't feel like a very clean approach.

So, does anyone have any suggestions regarding how to best achieve this (in terms of normalization and/or performance)?

Many thanks, Rich

+1  A: 
Cohort
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL)
    comments (TEXT)

Parameters
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("systolic blood pressure", "trygliceride", ...)

CohortParameters
    id (INT, NOT NULL, PRIMARY)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameters.id)
    value (TEXT)

DistributionTypes
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("Triangular", "Weibull", ...)

Distributions
    id (INT, NOT NULL, PRIMARY)
    distribution_type_id (FOREIGN KEY referencing DistributionTypes.id)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameter.id)
    minimum (FLOAT)
    maximum (FLOAT)
    mean (FLOAT)
    mode (FLOAT)
    sd (FLOAT)
    ...other distribution parameters (alpha, beta, shape, scale, etc.)
chaos
Thanks very much for your prompt response! I'm a little unclear about a couple of aspects of the solution though, specifically the CohortParameters table - what is its purpose and what purpose would the value column serve? Also, the Distributions table would still have the NULL values issue (although negligible wasted space aside, I still haven't convinced myself that this is genuinely a problem...). Thanks again for your input on this, Rich.
Rich Pollock
A: 

Having separate tables for different distribution types sounds right to me. In your application logic, you'll have to special-case each distribution type, anyway (I presume), as it may need different rendering in the UI, or different computations.

Martin v. Löwis
A: 

Your thought to have a table for each distribution type is probably what you want. That way, you have a well-defined table with each value you need specific to your distribution type. This will save you space, will allow you to lock down which fields are nullable and which are not, and will result in increased performance. If each distribution has a common set of parameters, you could arrange your tables in a supertype/subtype relationship to further normalize the schema.

Dave Markle
A: 

How will you use the data when you query it?

If you are querying a number of cohorts, and it's reasonable for the cohorts to have different distributions then your result would be a "union", where indeed many columns would be null. In which case your results are in some sense "not normal", but that doesn't mean that the schema should be.

The advantage of having different tables for different distributions types is that each table would explicit define the columns that must be populated to describe that distribution, you can even then set some columns to be "not null".

I like the general idea of your proposal.

djna
Thanks very much for the reply (and also to Martin and Dave above). I won't be routinely querying multiple cohorts, so UNION won't be involved. I'm glad you agree with the "different table for different distribution" idea, but I've hit a problem with the implementation.In my middle table (associating cohorts with the distributions), I have the dist_ID stored in a column, but I only know which table this refers to by querying the dist_type column. As such, I can't use any of InnoDB's referential integrity features like cascade deletion. Any thoughts? Maybe another question is in order...
Rich Pollock
+1  A: 

Your design seems to indicate that there can only be one single type of distribution data per item of measured information. It seems impossible, in your design, to have both "even distribution" and "triangular distribution" data on, say, "systolic blood pressure".

This seems to indicate that for each individual piece of "measured information", you already know upfront, at system design time, what kind of distribution data is available.

This in turn seems to indicate that there is no need what so ever (and from a relational point of view it is outright bad to do so) to gather these different kinds of distribution in a single collection, only to reinstate any necessary distinction by adding a superfluous "distribution type" column.

EDIT

"The distribution type column also becomes necessary as soon as there are two or more cohorts in the database with differently distributed physiological parameters."

That seems crap. Distinct cohorts hold distinct distribution measurement IDs, and distinct distribution measurement IDs can be of different distribution types by your very own design.

Each piece of measured information can only have one distribution *at any one time*, but the distribution type can be changed by the user through a web interface (depending on the clinical trial data they are using, for example). The distribution type column also becomes necessary as soon as there are two or more cohorts in the database with differently distributed physiological parameters. Hope that helps.
Rich Pollock
"Distinct cohorts hold distinct distribution measurement IDs, and distinct distribution measurement IDs can be of different distribution types by your very own design." Why does this make the requirement for a distribution type column "crap"? I still need something to link each cohort characteristic to its distribution and if each distribution type is stored in a different table, then I need some way to identify which table it's in. While it's true that it *needn't* be in the Cohort table itself, I'm pretty sure that 3NF dictates that it should be there.
Rich Pollock