views:

128

answers:

6

In a DB I'm designing, there's one fairly central table representing something that's been sold or is for sale. It distinguishes between personal sales (like eBay) and sales from a proper company. This means there is literally 1 or two fields which are not equally appropiate to both cases... for instance one field is only used in one case, another field is optional in one case but mandatory in the other. If there were more specialty it would be sensible to have a core table and then two tables with the fields relevant to the specific cases. But here, creating two tables just to contain like one field plus the reference to the core table seems both aesthetically bad, and painful to the query designer and DB software.

What do you think? Is it ok to bend the rules slightly by having a single table with weakened constraints - meaning the DB cannot 100% prevent inconsistent data being added (in a very limited way) - or do I suck it up and create dumb-looking 1-field tables?

+1  A: 

I think the choice of having these fields is not going to hurt you today and would be the choice I would go for. just remember that as your database evolves you may need to make the decision to refactor to 2 separate tables, (if you need more fields)

John Nolan
A: 

If there are two distinct entities, "Personal Sales" and "Company Sales", then perhaps you ought to have two tables to represent those entities?

Scott Ferguson
That is the dilemma. Both share the same set of fields so having two unrelated tables would be bad... especially since some queries will not distinguish between personal/company.
John
+2  A: 

What you're describing with one table for common columns and dependent tables for subtype-specific columns is called Class Table Inheritance. It's a perfectly good thing to do.

What @Scott Ferguson seems to be describing (two distinct tables for the two types of sales) is called Concrete Table Inheritance. It can also be a good solution depending on your needs, but more often it just makes it harder to write query across both subtypes.

If all you need is one or two columns that apply only to a given subtype, I agree it seems like overkill to create dependent tables. Remember that most brands of SQL database support CHECK constraints or triggers, so you can design data integrity rules into the metadata.

CREATE TABLE Sales (
 sale_id SERIAL,
 is_business INT NOT NULL, -- 1 for corporate, 0 for personal
 sku VARCHAR(20),          -- only for corporate
 paypal_id VARCHAR(20),    -- mandatory but only for personal
 CONSTRAINT CHECK (is_business = 0 AND paypal_id IS NOT NULL)
);
Bill Karwin
A: 

News flash: the DB cannot prevent 100% of corrupt data now matter which way you cut it. So far you have only considered what I call level 1 corruption (level 0 corruption is essentially what would happen if you wrote garbage over your database with a hex editor).

I have yet to see a database that could prevent level 2 corruption (syntactically correct records but when taken as a whole mean something perverse).

Joshua
+1  A: 

There are some who insist that inapplicable fields should never be allowed, but I think this is one of those rules that someone wrote in a book and now we're all supposed to follow it without questioning why. In the case you're describing, a single table sounds like the simple, intelligent solution.

I would certainly not create two tables. Then all the common fields would be duplicated, and all your queries would have to join or union two tables. So the real question is, One table or three. But you seem to realize that.

You didn't clarify what the additional fields are. If the presence or absence of one field implies the record type, then I sometimes use that fact as the record type indicator rather than creating a redundant type. Like, if the only difference between a "personal sale" and a "business sale" is that a business sale has a foreign key for a company filled in, then you could simply state that you define a business sale as one with a company filled in, and no ambiguity is possible. But if the situation gets even slightly more complicated, this can be a trap: I've seen applications that say if a is null and b=c d / 7 = then it's record type A, else if b is null and etc etc. If you can't do it with one test on one field, forget it and put in a record type field.

You can always enforce consistency with code or constraints.

I worry a lot more about redundant data creating consistency problems then inapplicable fields. Redundant data creates all sorts of problems. Data inapplicable to a record type? In the worst case, just ignore it. If it's a "personal sale" and somehow a company got filled in, ignore it or null it out on sight. Problem solved.

Jay
A: 

The PRO for keeping all fields in one table is that you get rid of JOIN's which makes your queries faster.

The CONTRA is that your table grows larger which makes your queries slower.

Which one impacts you more, totally depends on your data distribution and which queries you issue most often.

In general, splitting is better for OLTP systems, joining is better for data analysis (that tends to scan the tables).

Let's imagine 2 scenarios:

  1. Split fields. There are 1,000,000 rows, the average row size is 20 bytes, the split field is filled once per 50 rows (i. e. 20,000 records in the split table).

    We want to query like this:

    SELECT  SUM(mainfield + COALESCE(splitfield, 0))
    FROM    maintable
    LEFT JOIN
            splittable
    ON      splitid = mainid
    

    This will require scanning 20,000,000 bytes and nested loops (or hash lookups) to find 10,000 records.

    Each hash lookup is roughly equivalent to scanning 10 rows, so the total time will be equivalent of scanning 20,000,000 + 10 * 20,000 * 20 = 24,000,000 bytes

  2. Joined fields. There are 1,000,000 rows, the average row size is 24 bytes, so the query will scan 24,000,000 bytes.

As you can see, the times tie.

However, if either parameter changes (field is filled more often or more rarely, the row size is more or less, etc), one or another solution will become better.

Quassnoi
I think more important than minor performance differences is maintainability. If you create two tables that have twenty fields in common and three columns different, then you have to write twice as many queries, any change to the schema must be done twice, etc.
Jay
@Jay: the common fields would be only stored in one table.
Quassnoi