views:

210

answers:

10

An application I inherited tracks lab test results performed on material samples. Data is stored in a single table (tblSampleData) with a primary key of SampleID and 235 columns representing potential test results. The problem is that only a few tests are performed per sample, so each row contains over 200 nulls. Actually, there is a second similar table (tblSampleData2) with another 215 primarily null columns and a primary key of SampleID. The two tables have a one-to-one relationship and most SampleIDs have some data in both tables. For every SampleID, however, there are easily 400 null columns!

Is this bad database design? If so, which normal form rule is broken? How can I query this table to identify which groups of columns are typically filled together with data? My goal would be to have, say 45 tables with 10 columns and fewer null values. How can I do this? How do I avoid breaking existing applications?

The tables have about 200,000 sample records so far. Users are asking me to add more columns for more tests, but I'd rather build a new table. Is this wise?

+4  A: 

You could use the well known Entity Attribute Value model (EAV). The description of when it is appropriate to use EAV fits quite well with your use case:

This data representation is analogous to space-efficient methods of storing a sparse matrix, where only non-empty values are stored.

One example of EAV modeling in production databases is seen with the clinical findings (past history, present complaints, physical examination, lab tests, special investigations, diagnoses) that can apply to a patient. Across all specialties of medicine, these can range in the hundreds of thousands (with new tests being developed every month). The majority of individuals who visit a doctor, however, have relatively few findings.

In your specific case:

  • The entity is a material sample.
  • The attribute is a test type.
  • The value is the result of a test for a specific sample.

EAV has some serious drawbacks and creates a number of difficulties so it should only be applied when it is appropriate to do so. You should not use it if you need to return all test results for a specific sample in a single row.

It will be difficult to modify the database to use this structure without breaking existing applications.

Mark Byers
+1. If you have hundreds of columns with mostly nulls, then you're doing it wrong..
tomdemuyt
Migrating to EAV will make it easier for users to define new attributes, without having to modify the database anymore.
pascal
+1  A: 

Just because no normal-form rules are broken doesn't mean it isn't bad database design. Generally you're better off with a design with smaller rows more tightly packed, because that way more rows can fit in a page so there is less work for the database to do. With the current design the database server is having to devote a lot of space to holding null values.

Avoiding breaking existing applications is the tough part, if the other applications need only read-access you could write a view that looks identical to the old table.

Nathan Hughes
+8  A: 

I have seen articles / papers that indicate that simply having NULLs in the database breaks the first normal form.

From what I've gathered from your description of the database, a better design might be as follows:

A Sample table with fields that are always associated with a sample. For example,

Sample
------ 
SampleID 
SampleDate 
SampleSource

Then, a table of test types with one entry for each type of test that can be performed.

TestType
--------
TestTypeID
TestName
MaximumAllowedValue

Finally, have an intermediate table that represents the many-to-many relationship between the two above tables and holds the results for the tests.

TestResult
----------
SampleID
TestTypeID
TestResult

This would eliminate the null values because the TestResult table would only contain entries for the tests that were actually performed on each sample. I once designed a database for an almost identical purpose to what I believe you are doing and this is the approach I took.

NYSystemsAnalyst
Carl Manaster
I like this answer, but I want to be sure I understand it. Will my current 450 columns become 450 TestType rows with TestNames matching the original table column names? I like that, because I wouldn't need to create a new table every time some new tests need to be added. Does this make sense: I could include a TestGroup table to identify classes or categories of similar tests? The TestType table would contain a TestGroupID foreign key. The TestGroupNames would represent what I previously thought should be separate table names.
DeveloperDan
Correct, the 450 columns would become 450 rows in the TestType table. Then, for each sample, you would just take the tests that were actually performed to make entries in the TestResult table. This would definitely make the database easier to maintain as new tests are added.Yes, you could certainly include a TestGroup table as you described. This would make it much easier to group the tests for display, such as on reports.As Carl mentioned in his comment, make sure you set your keys and constraints correctly to avoid duplicate test result entries.
NYSystemsAnalyst
How does this structure handle different TestResult data types? Currently test results are floats, ints and varchars, yet the above table contains only one TestResult column with no data type specified.
DeveloperDan
Without knowing the actual data you are working with, it's difficult to make a recommendation for this. One possibility would be to have three columns in the TestResult table, one for float, one for int, and one for varchar. In the TestType table, have a field called TestDataType that has a value of 1, 2, or 3. Then, when inserting into the TestResult table, use the appropriate column based on the value of TestDataType. That is just one idea.
NYSystemsAnalyst
+1  A: 

If you do change your table structure, I'd recommend having a view called tblSampleData which returns the same data as the table does now. That'll preserve some compatibility.

Jonathan
It may be sensible to refactor the application anyways, but this will prevent the application from breaking initially.
Kenny Evitt
+1  A: 

I'm not sure the design is really that bad. NULL values should actually be relatively cheap to store. In SQL Server, there is an internal bit field (or fields) for each row that indicates which column values are NULL.

If the performance of the application doesn't need to be improved and the cost-benefit of refactoring due to changing the table schema isn't positive, why change it?

Kenny Evitt
He seems to indicate that he needs to add tests from time to time, which involves repeatedly changing the table schema AND any related queries or procedures. There is definitely a cost for that too.
NYSystemsAnalyst
A: 

I d go with 1 main table, where you'd have 1 row per sample, it would contain all the columns that every sample should have:

Sample
-------
SampleID  int auto increment PK
SampleComment
SampleDate
SampleOrigin
....

I'd then add one table for each different test or "class" of similar tests, and include all columns related to those (use the actual test name and not XYZ):

TestMethod_XYZ
---------------
SampleID    int FK Sample.SampleID
MeltTemp
BurnTemp
TestPersonID
DateTested
...

TestMethod_ABC
---------------
SampleID    int FK Sample.SampleID
MinImpactForce
TestPersonID
DateTested
....

TestMethod_MNO
---------------
SampleID    int FK Sample.SampleID
ReactionYN
TimeToReact
ReactionType
TestPersonID
DateTested
...

When you search for a result, you would search the test method table that applies and join back to the actual sample table.

KM
You've elaborated my original idea. However, I hoped to come up with a clever query to determine classes of tests. That is, based on existing data what are the likely tables each column should be broken out into. Of course, I could just ask people in the lab to categorize their test for me, but where's the fun in that?
DeveloperDan
A: 

Let's say you have test machine X with 40 measurement channels. If you know that on each test the testers will use just a few channels, you could change the design to:

tblTest: testId, testDate tblResult: testId, machineId, channelId, Result

You could always retrieve the pevious layout using a crosstab.

iDevlop
A: 

EAV is an option but the queries will kill you.

Is it an option to migrate the data to a NoSQL DB like MongoDB? I believe this will be the most efficient and easy way arround your problem. Since you mentioned you're basically doing CRUD queries NoSQL should be pretty efficient.

Alix Axel
Migration is not likely. I've never heard of MongoDB. I'll google it and NoSQL.
DeveloperDan
It's a schema-less database, from the description of your problem it seems to fit perfectly: http://www.mongodb.org/.
Alix Axel
A: 

The current design is a poor one. In general a database with lots of NULL values is an indication of poor design, violating 4th normal form. But the biggest problem with the design is not a violation of normal principles but the fact that the addition of a new test type requires changes to the database structure rather than simply adding some data to several tables that "define" a test. Even worse, it requires structural changes to an existing table, rather than addition of new tables.

You can achieve perfect fourth normal form by adapting a key-value system as described by others. But you may be able to substantially improve the design of the database and still maintain your sanity (something hard to do when working with key-value systems without an ORM) by doing either of the following:

  1. Attempt to discover the largest number of measurements required to represent any individual test. If there are different data types returned by the tests, you'll need to discover the largest number of values of each data type returned by the largest test. Create a table with those columns only, labelled Meas1, Meas2, etc. Instead of 400 columns you'll need, perhaps, 10. Or 40. Then create a set of tables that describe what each column "means" for each test. This information can be used to provide meaningful prompts and report column headers depending on the type of test being stored. This will not eliminate NULLs entirely, but will greatly reduce them and, as long as any new test can be "fit" into the number of measurements you specified, new test can be added as data rather than structural changes.

  2. Discover the actual list of measurements for each test and create a separate table to hold the results of each one (basic information like test ID, who ran it, the time, etc still go into a single table). This is a multi-table inheritance pattern (I don't know if it has a real name). You still have to create a new "data" table for each new test, but now you wont be touching other existing production tables and you will be able to achieve perfect normal form.

I hope this provides some ideas to get started.

Larry Lustig
We don't use an ORM. Can you elaborate on why maintaining the key-value system is difficult? I'll consider your idea if it's easier to maintain but I don't quite understand the structure. I'll try searching multi-table inheritance pattern, or perhaps someone could suggest a link or show a sample table structure?
DeveloperDan
+1  A: 
  1. You probably don't even need a RDBMS for this data. Store your data in structured binary files or a DBM/ISAM table.

  2. It's not normalized. Usually, lack of normalization is the source of all your problems. But in this case, lack of normalization is not the end of the world because this data is "read only", there's only one key, and it's not related to anything else. So update anomalies shouldn't be a worry. You only have to worry that the original data is consistent.

  3. There's nothing too terribly wrong with all those NULLs if you treat NULLs as a "special value" with the same meaning across the entire app. Data was not collected. Data not available. Subject refused to answer question. Data is outlier. Data is pending. Data is known to be UNKNOWN. Subject said they didn't know... etc. you get the idea. Allowing NULLs for no defined reason with no defined meaning is terribly wrong.

  4. I say normalize it. Either define special values and create one massive table. Or, leave NULLs for the VB and PHP programmers, and split it up your data properly. Create a VIEW to join the data back up if you need to support legacy code. From what you described, you're talking about a couple hours of work to get this thing correct. That's not such a bad deal.

John