I need to create a database table that stores parametric descriptions of physiological characteristics (e.g. systolic blood pressure, triglyceride concentrations, etc.) of a hypothetical cohort of patients.
For example, supposing the user specifies a triangular distribution for SBP, then the minimum, maximum and mode (and distribution type) would have to be stored. Alternatively, the user may specify a normal distribution, requiring storage of the mean and standard deviation.
I'm struggling with the correct way to normalize these data. Currently, I have a Cohort table and a Distribution table with a number of one-to-one relationships as follows (some fields omitted):
Cohort id (INT, NOT NULL, PRIMARY) name (TEXT, NOT NULL) comments (TEXT) systolic_blood_pressure_dist (FOREIGN KEY referencing Distributions.id) triglyceride_dist (FOREIGN KEY referencing Distributions.id) ...other physiological parameters Distributions id (INT, NOT NULL, PRIMARY) distribution_type (TEXT) minimum (FLOAT) maximum (FLOAT) mean (FLOAT) mode (FLOAT) sd (FLOAT) ...other distribution parameters (alpha, beta, shape, scale, etc.)
(distribution_type holds a string describing the distribution: "Triangular", "Weibull", etc.)
I'm pretty sure this is not the optimum way to do this as I'm left with loads of NULL fields in each row of Distributions.
My other thought was to have separate tables for each distribution type (one for triangular, one for Gaussian, one for uniform, etc.) and have a table in the middle with an id column (to be used as a foreign key in the Cohort table *_dist columns), a distribution type column and an id column to store the foreign key for the row in the appropriate distribution table.
The query would use the id stored in the Cohort column to find the distribution type and row id from the middle table, then lookup the parameters in the appropriate table using the id. However, using a string to select the appropriate table, then another id to select the appropriate row is far from a traditional JOIN and also doesn't feel like a very clean approach.
So, does anyone have any suggestions regarding how to best achieve this (in terms of normalization and/or performance)?
Many thanks, Rich