views:

59

answers:

4

Some while ago, I've been reading through the book SQL and Relational Theory by C. J. Date. The author is well-known for criticising SQL's three-valued logic (3VL).1)

The author makes some strong points about why 3VL should be avoided in SQL, however he doesn't outline how a database model would look like if nullable columns weren't allowed. I've thought on this for a bit and have come up with the following solutions. If I missed other design options, I would like to hear about them!

1) Date's critique of SQL's 3VL has in turn been criticized too: see this paper by Claude Rubinson (includes the original critique by C. J. Date).


Example table:

As an example, take the following table where we have one nullable column (DateOfBirth):

#  +-------------------------------------------+
#  |                   People                  |
#  +------------+--------------+---------------+
#  |  PersonID  |  Name        |  DateOfBirth  |
#  +============+--------------+---------------+
#  |  1         |  Banana Man  |  NULL         |
#  +------------+--------------+---------------+

Option 1: Emulating NULL through a flag and a default value:

Instead of making the column nullable, any default value is specified (e.g. 1900-01-01). An additional BOOLEAN column will specify whether the value in DateOfBirth should simply be ignored or whether it actually contains data.

#  +------------------------------------------------------------------+
#  |                              People'                             |
#  +------------+--------------+----------------------+---------------+
#  |  PersonID  |  Name        |  IsDateOfBirthKnown  |  DateOfBirth  |
#  +============+--------------+----------------------+---------------+
#  |  1         |  Banana Man  |  FALSE               |  1900-01-01   |
#  +------------+--------------+----------------------+---------------+

Option 2: Turning a nullable column into a separate table:

The nullable column is replaced by a new table (DatesOfBirth). If a record doesn't have data for that column, there won't be a record in the new table:

#  +---------------------------+ 1    0..1 +----------------------------+
#  |         People'           | <-------> |         DatesOfBirth       |
#  +------------+--------------+           +------------+---------------+
#  |  PersonID  |  Name        |           |  PersonID  |  DateOfBirth  |
#  +============+--------------+           +============+---------------+
#  |  1         |  Banana Man  |
#  +------------+--------------+

While this seems like the better solution, this would possibly result in many tables that need to be joined for a single query. Since OUTER JOINs won't be allowed (because they would introduce NULL into the result set), all the necessary data could possibly no longer be fetched with just a single query as before.


Question: Are there any other options for eliminating NULL (and if so, what are they)?

+1  A: 

I haven't read it, but there's an article called How To Handle Missing Information Using S-by-C on the the Third Manifesto website that's run by Hugh Darwen and C.J. Date. This isn't written by C.J. Date, but I'd assume that since it's one of the articles on that website it's probably similar to his opinions.

ho1
There's some interesting information on that website. However, the specific paper that you mention seems more academic in nature: It gives examples in the Tutorial D language and not in SQL, which is what's actually used today. (+1 for the link to the website.)
stakx
@stakx: Ah, doesn't surprise me I suppose, I found the website a while ago after reading another question on SO but I found it all a bit too academical to actually read through anything.
ho1
+1  A: 

One alternative may be the entity-attribute-value model:

 entity  attribute    value
 1       name         Banana Man
 1       birthdate    1968-06-20

If the birthdate was unknown, you'd just omit its row.

Andomar
I guess the problem with this model is that you have to choose a single type for column `value` that is appropriate for all sorts of things (ie. this would require parsing strings or even `BLOB`s on the client side).
stakx
@stakx: You could add a `type` column along with `str_value`, `int_value`, `real_value` columns. And I think SQLIte supports EAV natively and store integers, floats and strings in the same column efficiently
Andomar
@Andomar: I didn't know that SQLite has native support for the EAV model. It's an interesting feature, despite being a non-standard (?) one that other RDBMS might not have. Thanks for the info!
stakx
EAV is at least controversial - check out http://www.simple-talk.com/sql/t-sql-programming/avoiding-the-eav-of-destruction/ or http://www.simple-talk.com/sql/database-administration/five-simple--database-design-errors-you-should-avoid/ for some good reason why you should at least be careful when choosing the EAV approach
marc_s
+2  A: 

I saw Date's colleague Hugh Darwen discuss this issue in an excellent presentation "How To Handle Missing Information Without Using NULL", which is available on the Third Manifesto website.

His solution is a variant on your second approach. It's sixth normal form, with tables to hold both Date of Birth and identifiers where it is unknown:

#  +-----------------------------+ 1    0..1 +----------------------------+
#  |         People'             | <-------> |         DatesOfBirth       |
#  +------------+----------------+           +------------+---------------+
#  |  PersonID  |  Name          |           |  PersonID  |  DateOfBirth  |
#  +============+----------------+           +============+---------------+
#  |  1         |  Banana Man    |           ! 2          | 20-MAY-1991   |
#  |  2         |  Satsuma Girl  |           +------------+---------------+
#  +------------+----------------+
#                                  1    0..1 +------------+
#                                  <-------> | DobUnknown |
#                                            +------------+
#                                            |  PersonID  |
#                                            +============+
#                                            | 1          |
#                                            +------------+

Selecting from People then requires joining all three tables, including boilerplate to indicate the unknown Dates Of Birth.

Of course, this is somewhat theoretical. The state of SQL these days is still not sufficiently advanced to handle all this. Hugh's presentation covers these shortcomings. One thing he mentions is not entirely correct: some flavours of SQL do support multiple assignment - for instance Oracle's INSERT ALL syntax.

APC
I've read the paper on Third Manifesto and quite like the solution, mainly because it resolves the ambiguity of what exactly `NULL` means. While this separation of different meanings of `NULL` into separate tables improves data quality, I agree with you that actually querying the data will become more difficult.
stakx
+1  A: 

Option 3: Onus on the record writer:

CREATE TABLE Person
(
  PersonId int PRIMARY KEY IDENTITY(1,1),
  Name nvarchar(100) NOT NULL,
  DateOfBirth datetime NOT NULL
)

Why contort a model to allow null representation when your goal is to eliminate them?

David B
*Every* solution to my question would eventually have `NOT NULL` on all columns (where applicable). But the real question -- if you don't reduce my post to just the last sentence -- was *how* a relational data model would have to be modified **so that it can still accomodate missing or unknown information** for some attributes. Your solution however completely *disallows* missing information for the `DateOfBirth` column and thus misses the most important bit.
stakx
If you still permit the third value, you must still write logic about the third value. I don't know why your implementation of the third value would be better than any db vendor's implementation.
David B
@David B: I want to get *rid* of the third value (`UNKNOWN`), *but* at the same time still be able to represent the fact that some information is missing (or unknown, or whatever). This is obviously possible. I am investigating different ways of doing this. You're perhaps saying, "Why not just use `NULL`, it's there for just that purpose?" -- Because the 3VL *can* lead to confusing and unintuitive queries and query results. Even without 3VL, one of course still has to deal *somehow* with possibly missing information -- but doing this with 2VL might prevent some logic errors.
stakx
It's the concept of null that makes the queries confusing and unintuitive, not the mechanism of null. As long as the concept is allowed, you will have to struggle against it.
David B