views:

194

answers:

7

We have two tables: OriginalDocument and ProcessedDocument. In the first one we put an original, not processed document. After it's validated and processed (converted to our xml format and parsed), it's put into ProcessedDocument table. Processed document can be valid or invalid. Which makes more sense: have two different tables for valid and invalid documents or just have one with 'Valid' column? Some of the columns (~5-7) are irrelevant for invalid document. Storing both invalid and valid documents would also make Document table filled with 'NULL' columns (if document is invalid, information like document number, receiver can be unknown). What else should we consider and weigh, when making this decision?

+4  A: 

To me it sounds that it would make sense to have a bit column, as all documents have actually been processed, it is just that some have been determined to be invalid. And depending on the number of columns if you only have 5 or so out of say 10-15 columns that don't apply, there is no need to manage two structures for the same data.

Now, another thing you could look at is do you need to regularly get information on both valid and invalid documents at the same time? if so, then you really do want it in one table.

If you don't ever need to query them together, or if a document is "invalid" you don't need it again except for history, then it could make sense to move it to its own table.

Mitchel Sellers
+5  A: 

Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table.

However, if an invalid document is treated differently by your application to the point where it is almost forgotten (not queried, updated etc.) then split the tables. Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit.

I have a document table where valid and invalid documents are kept together but only because the app re-presents the bad document to the user and asks them to fix it.

CResults
Or just partition/subpartition/index the Document table on the DOCUMENT_VALIDITY field, if that's an option in the DBMS of choice. The validity of the document sounds like an attribute of the document, so map it as such.
Adam Musch
amen Adam Musch
Stephanie Page
+1  A: 

What shape are your queries? Do you frequently wish to deal with a group (all?) documents, irrespective of whether they're valid? Or does every query only every concern valid (or invalid) documents.

Or do you wish to deal with groups (irrespective of validity), but wish to frequently perform additional work with valid documents. That may point to a base table and an additional table containing the valid document columns?

Damien_The_Unbeliever
+1  A: 

Think of OriginalDocuments as of intermediate table. It can change as you input formats change. And it will contain fields which are not valid for imported ("processed") documents, like import date or import error description. And you can clean this table periodically.

In contrast to OriginalDocument, ProcessedDocument table will contain only documents and fields valid for your system, with all of the check constraints, indexes and associated business logic. It's structure will change as your system's internal logic changes.

+1  A: 

Another thing you might want to take into consideration is the lifecycle and use cases of the rows. If the invalid documents are purged regularly, it might help to have them in separate tables. If the attributes of invalid documents stay limited, but valid documents are getting new columns, that would be a factor in favor of separate tables, too. As the entities are more and more different in behavior and usage, there are more indications that separate tables are merited.

Cade Roux
They are not different entities. They are all the same entity. A delete is a delete from any table. If truncates are possible, use partitioning.
Stephanie Page
@Stephanie: You can not know if the two types of documents should be realized as different tables or not on the physical level. Not unless you know the full semantics of the data and all the processes that might ever be considered to be a part of the system. On conceptual level they are not the same - as long as you have a single column that is always NULL for certain records you have denormalized entities and further normalization is perfectly possible.
Unreason
+2  A: 

Wow, so much bad advice and design myths in a single question it's hard to know where to start.

Is this a VLDB? are you talking 100's of TB, 100's of GB, 1-10 GB?

Is this an untra-high performance DB? Do you need to squeeze out microseconds?

Most advice tends to lean toward those extremes where you might break a few basic rules for the sake of performance.

An earlier poster said,

"Whether the document is valid or invalid, it is still a document so it makes inital sense for them all to be in the same table."

He was on the right track there. And for that matter, whether it's processed or unprocessed it's also a document. I strongly question the first table split.

He then says,

"Having the two types of document together in the same table will do nothing but slow down your queries for no immediate benefit."

I have no idea what that advice is based upon. If your RDBMS supports indexes, more data will have a very marginal additional cost at certain sizes of your index because your b-tree gets one level deeper. If you take his statement at face value, you should limit your table to n rows each and keep making new ones because "more data in your table = slower queries." I have no idea why people persist in this notion. If you have queries that REQUIRE full table scans for one type or the other, let's talk partitioning, not a new table. It takes about an extra 10 milliseconds to find a row in a billion row table than it does in a million row table because an index will probably only be one blevel deeper between the two.

Another poster said,

"5-7 columns that do not apply to invalid documents NOT NULL so valid documents are required to have them. In my opinion, with that many columns empty for invalid documents, it justifies a different table."

I wish people would explain there reasons. HOW does it justify it? On what basis would you make that decision. Is 4 too many? Why not? But 5 is too many? Maybe he assumes you're using an ancient RDBMS with fixed field lengths. I can't tell. If you put the nullable columns at the end of the row, you'll pay no cost for them. In the middle, a few extra bytes. If that's a HUGE deal, if you're really scrambling to make this multi-TB table a wee smaller... we'll talk about vertical partitioning... not a whole new table. Since you'll be extending the length of n% of rows, you'll want to carefully choose your PCTFREE, or how ever your database does that. Other than that, there's little downside of the nullable columns.

So let's talk about all the downsides of three tables.

I'm going to assume your table looks like this;

A surrogate PK column with a unique index.
A candidate key column with a unique index.
a few foreign keys to 'lookup' tables.
Several data fields.
the 5-7 nullable columns that are filled if a document becomes invalid.

The first issue is that you'll have 3 PK's across all the tables to make sure that the key is unique... but there's no cross table object to guarantee uniqueness in all three combined. Unless you're painstaking in your approach to the code that moves data from one table to the next, you could have the same document twice or more. Once in each table. If you have a single table for Original, processed, and invalid, then there's no way you'll ever have that happen.

With three tables, all of your constraints are going to be validated over and over. When you do your insert into the Original table, the PK is validated, the AK is validate, the FKs are validated, the other columns are validate. Room is made in all of the indexes for these new enteries, and perhaps causing block splits. Now you process the file and delete the entry from the Original table, all of those indexes suffer deletes, leaving empty space behind. Your insert into the next table, suffers all of that cost of your first insert again. Your indexes are acted upon, maybe causing block splits, your PK, AK and FK's are all validated again. Lather rinse repeat for invalid table.

Now, what happens to your data model if you adopt this paradigm when you discover that the business needs a 4th state? You're going to add a 4th document table for those in the unsubmitted state, or sent state. After all, the new sent state has 5-7 columns unneeded by the other states.

And there are lots of queries which become hoorible to write and run with multiple tables, with a single table they are clear, concise and fast... size of a table will really only affect Full Table Scans, which we try to avoid for tables like these.

I've seen systems like these. One major operational query is, "Where is my document?"

You've got to search 3 tables to find its state. What most people do next is build a UNION ALL view of all three tables to facilitate the myriad of questions like that. IF the other poster thinks your queries slow down with other data in your table, see how they really slow down when you do a UNION ALL to accomplish the same thing. 1 index of blevel 3 as opposed to 3 indexes of blevel 2.

EXAMPLE/EDIT

I work in a trading company. We execute trades with counterparties. For accounting and legal reasons our company is defined as several companies. Well call them Trading, Holding, JointVenture. Our counterparties we'll call. JonesCo, SmithBarely, GoldSax.

So if I consider that the internal companies have a unique set of columns and the counterparties have a unique set of columns. You'd say that proper normalization would force them into two tables. So let's do that.

INT_CO_T 1 Trading 2 Holding 3 JointVenture

CNTR_PTY_T 1 JonesCo 2 SmithBarely 3 GoldSax

Now I need a trade table where I map the transaction between our company(ies) and counterparties

TRADE_T (Int_co_T.ID, Ctr_pty_T.ID, other trade columns)

Great.

Whoops, Business says that the JointVenture will execute trades with Trading. BTW, This is a very common scenario, this happens all the time. Trading house would call these Book-to-Book trades.

Now I'm left with two choices. (Three really) but.

1 is that I could do something very silly and place JointVenture and Trading into the Counterparty table so that my mapping table will still work. This leads to nightmare queries which I'm sure those involved in this conversation will recognize. Or I can build a separate Mapping table.. and that too leads to some unions if I want to see all of the trades from a given company.

The third and better way is to build a single table for both counterparties and internal companies, called Trading_entities or something. Now I need one mapping table to show either internal or external trades. I can easily see net position and net exposure with one query, two tables. etc.

If you're really hung up on the nullable fields then vertically partition that table and use three tables. But the main table will have a list and most importantly a single key for either subtype of trading participant.

Stephanie Page
+1 Nice answer. Made me think again about my design choices
CResults
Great answer! We actually need to store both unprocessed and processed documents, so there would be no deleting going on. From the question: "(converted to our xml format and parsed)"
Aidas
Stephanie, your answer is ok, but not great. Denormalizing brings performance in certain cases (from particular perspective). What your answer lacks is a mention of when that same denormalization will bite your ass. For example, where is the limit for you - if you have two entities with different properties do you stick them into the same table regardless of how different as long as they match on a few properties? Should we store everything in one big table?
Unreason
Goran: I disagree with your characterization of storing all the facts about a document in a document table as denormalization. It is perfectly normal (pun intended) to store optional state information about the same entity in the same table. What form do you think it violates? Would you store a person's contact info in two separate tables, based on whether they had a a middle name or not?
Phil Sandler
@Sandler: I was talking about a case when a column is NULL for ALL records of a certain type. That can certainly lead to anomalies/inconsistent data and there is redundancy (therefore it is denormalized). As for the NF, according to C.J.Date it breaks 1NF (http://en.wikipedia.org/wiki/First_normal_form). But this IS controversial. Pragmatically, again, I ask would you put all the entities in your database that have one-to-(zero-or-)one relationship into one table? Where is the line?
Unreason
Goran: in the middle name example, the column would be null for all records of a certain type--people with no middle names, so you would need a separate table to store those types, and many other types like contacts who have no phone number. Additionally, if you stand by Date's definition, you can NEVER use nulls. Is this your position? Your objection to the way the table is normalized is that it uses nulls?
Phil Sandler
Goran (2): I generally model one table for each logical entity in the system (some with nullable columns). There are logical and pragmatic reasons why I might store one or more subtypes in a separate related table, but I don't consider this to be a decision of normalization vs. denormalization.
Phil Sandler
@Sandler: Normalization is, by definition, a process of "ensuring that a database structure is free of certain undesirable characteristics—insertion, update, and deletion anomalies that could lead to a loss of data integrity." Having a document table in which I can insert inconsistent data is not normal in this sense (pun returned :). At physical level you could prevent it with checks; at conceptual level a natural way to express it is to subtype the entity. Also, please note that I have not said you should never do this, but that Stephanie fails to warn when such denormalization would hurt.
Unreason
Totally agree that in the CONCEPTUAL model you'd show these as subtypes. CONCEPTUAL models do not support NULLS at all. But when you make is physical, it is not only acceptable but preferred. I'm going to edit my answer above to give a prime example.
Stephanie Page
@Goran, no... I'd not store everything in one table. See my other rails against EAV's that many on the board recommend. But the question of when do you stop rolling up objects is a good one and one that can't have clear rules around it. Would you put plain teeshirts and graphic teeshirts in different tables? The printed shirt would need an description of what the graphic is. Would you put pants and shorts in the same table? Pants come with an inseam measurement, shorts do not. Would you put all 4 in the same table called apparel? I don't think there is one clear answer based solely on nulls.
Stephanie Page
cont... so when you go to construct an purchase for the GAP store website, you'd need to have a FK column to the graphic teeshirt table and another FK col to the plain teeshirt table and another to shorts. Or you could have one collection of apparel items. I'd vote for the later. Would I put my customers in there too? No, clearly not. Can I explain the algorithm I used? Not in a million years.
Stephanie Page
+1  A: 

Do try to make distinction between logical and physical modelling.

Even if the difference between the two entities is only seven properties, they are logically a different thing in those seven items. At the same time they are a same thing in other properties.

The way to logically represent that is this have one-to-one-or-zero relationship between the two tables, and to have one table store all the common properties (superclass) and in the other (subclass) you would only store the ID from the superclass.

In terms of performance this is not so bad:

  • when you don't care about what type of document you work with you will query the superclass table (gain)
  • when you know you want only specific properties found in the subclass table you will work only with that table (this might be real gain)
  • you will pay a price only when you need to join the two tables (joins have a price compared to denormalized structures such as storing everything in a single table)
  • you will also pay a price when inserting subclass records, because you will be inserting into two tables (this might be very low and/or justified)

Depending on the processes you are modelling, the frequency of these queries and other things (such as security for both entities, ownership, difference in integrity rules) you might decide to store this information in one table in the database or in two (either can be much faster in border-line cases and two table solution can also be denormalized a bit; for example you could still store information in a main table about the type of the document to avoid the join if that kind of query is all you care).

Or maybe your implementation decisions might be driven by your choice of application framework and for that reason you might really prefer working with single table or the other way around (for example automatic creation of data entry forms in frameworks such as django-admin).

Whatever you do, realize the difference between the logical and physical design. In your logical design normalize everything - it will pay off. In physical implementation make different scenarios and - test, test, test with your own data. Never confuse the order of the two (logical-conceptual and physical-practical modelling).

Unreason