views:

75

answers:

3

I want to represent documents in a database. There are several different types of documents. All documents have certain things in common, but not all documents are the same.

For example, let's say I have a basic table for documents...

TABLE docs (
    ID
    title
    content
)

Now let's say I have a subset of documents that can belong to a user, and that can have additional info associated with them. I could do the following...

TABLE docs (
    ID
    userID -> users(ID)
    title
    content
    additionalInfo
)

...however this will result in a lot of null values in the table, as only some documents can belong to a user, not all. So instead I have created a second table "ownedDocs" to extend "docs":

TABLE ownedDocs (
    docID -> docs(ID)
    userID -> users(ID)
    additionalInfo
)

I am wondering: Is this the right way to do it? (I am worried because while everything is in one table, I have a one-to-many relationship between docs and users. However, by creating a new table ownedDocs, the datastructure looks like I have a many-to-many relationship between docs and users - which will never occur.)

Thanks in advance for your help

+3  A: 

the datastructure looks like I have a many-to-many relationship between docs and users - which will never occur.

Understood, then you need to have the userid in the docs table to ensure a one-to-many relationship (one user, potentially many documents).

I don't see the harm in additional info columns being null if a document is not associated to a particular user, noted by the userid column being null. Splitting the additional info to another table still means a one-to-one relationship, so you're best off using one table with doc, user and additional info...

OMG Ponies
+1  A: 

It depends on the level of normalization you want to achieve. Typically, based on the description you are providing, I would structure my DB like so:

table docs (id, title, content);
table users (id, ...);
table users_docs (doc_id, user_id);
table doc_info(doc_id, additional_info);

Someone correct me if I am wrong, but this should be 3rd normal form. This keeps the structure nice and clean and uses only the bits required to store the data as expected. You can store all elements independently but related where needed.

Depending on the nature of the additional information, you need to make some changes. For example, will the additional info ALWAYS correspond to a user? Will it always be supplied if the doc is associated to a user? If so, then you can add it to the users_docs table. But this should at least show you the normalization.

cdburgess
Your schema is exactly the way I have been doing it. But, like I say, I am worried about losing the expression of a one-to-many relationship in the data structure itself. Even though one document can only ever belong to one user, when the data structure is completely normalized, as with the above, then this relationship is no longer clear. I guess I'm wondering: Which method is the most "proper"?
Travis
Hmmm. Maybe add the doc_id in users_docs as the primary key. Then it cannot be put in more than once.
cdburgess
+2  A: 

"by creating a new table ownedDocs, the datastructure looks like I have a many-to-many relationship between docs and users - which will never occur.)"

If you make OwnedDocs.DocId the primary key it will be quite clear that a 1:N relationship is impossible.

The modelling of zero or one to one relationships is tricky. If we have just the one sub-type then the single table with NULL columns is a reasonable approach. However it is good practice to ensure that the sub-types attributes are only populated when appropriate. In the given example that would mean a check constraint to enforce this rule:

check (userID is not null or AdditionalInfo is null)

Or maybe even this rule:

check ( (userID is not null and AdditionalInfo is not null)
        or (userID is null and AdditionalInfo is null) )

The relationship between attributes won't show up in an ERD (unless you use a naming convention). For sure, the mandatory nature of AdditionalInfo for owned documents won't be obvious in the second case.

Once we have several such sub-types the case for separate tables becomes compelling, especially if the sub-types constitute an arc e.g. a Document can be a FinancialDocument or a MedicalDocument or a PersonnelDocument but not more than one category. I once implemented such a model using a single table with lots of null columns, views and check constraints. It was horrible. Sub-type tables are definitely the way to go.

APC
Thank-you - specifying the docID as a primary key in the sub table is an elegent solution.
Travis