views:

165

answers:

2

As a corollary to this question I was wondering if there was good comparative studies I could consult and pass along about the advantages of using the RDMBS do the join optimization vs systematically denormalizing in order to always access a single table at a time.

Specifically I want information about :

  • Performance or normalisation versus denormalisation.
  • Scalability of normalized vs denormalized system.
  • Maintainability issues of denormalization.
  • model consistency issues with denormalization.

A bit of history to see where I am going here : Our system uses an in-house database abstraction layer but it is very old and cannot handle more than one table. As such all complex objects have to be instantiated using multiple queries on each of the related tables. Now to make sure the system always uses a single table heavy systematic denormalization is used throughout the tables, sometimes flattening two or three levels deep. As for n-n relationship they seemed to have worked around it by carefully crafting their data model to avoid such relations and always fall back on 1-n or n-1.

End result is a convoluted overly complex system where customer often complain about performance. When analyzing such bottle neck never they question these basic premises on which the system is based and always look for other solution.

Did I miss something ? I think the whole idea is wrong but somehow lack the irrefutable evidence to prove (or disprove) it, this is where I am turning to your collective wisdom to point me towards good, well accepted, literature that can convince other fellow in my team this approach is wrong (of convince me that I am just too paranoid and dogmatic about consistent data models).

My next step is building my own test bench and gather results, since I hate reinventing the wheel I want to know what there is on the subject already.

---- EDIT Notes : the system was first built with flat files without a database system... only later was it ported to a database because a client insisted on the system using Oracle. They did not refactor but simply added support for relational databases to existing system. Flat files support was later dropped but we are still awaiting refactors to take advantages of database.

+1  A: 

a thought: you have a clear impedence mis-match, a data access layer that allows access to only one table? Stop right there, this is simply inconsistent with optimal use of a relational database. Relational databases are designed to do complex queries really well. To have no option other than return a single table, and presumably do any joining in the bausiness layer, just doesn't make sense.

For justification of normalisation, and the potential consistency costs you can refer to all the material from Codd onwards, see the Wikipedia article.

I predict that benchmarking this kind of stuff will be a never ending activity, special cases will abound. I claim that normalisation is "normal", people get good enough performance fro a clean database deisgn. Perhaps an approach might be a survey: "How normalised is your data? Scale 0 to 4."

djna
I agree with you about the mismatch, every fibers of my technical being are yelling at me how wrong this is but to be convincing I need more than feelings and hunches. My experience with database is more practical than academic, as such I lack the theoretical background to make a case. There is a lot of material stemming from the article you mentioned, it will take me a while to walk through the branches.
Newtopian
+1  A: 

As far as I know, Dimensional Modeling is the only technique of systematic denormalization that has some theory behind it. This is the basis of data warehousing techniques.

DM was pioneered by Ralph Kimball in "A Dimensional Modeling Manifesto" in 1997. Kimball has also written a raft of books. The book that seems to have the best reviews is "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)" (2002), although I haven't read it yet.

There's no doubt that denormalization improves performance of certain types of queries, but it does so at the expense of other queries. For example, if you have a many-to-many relationship between, say, Products and Orders (in a typical ecommerce application), and you need it to be fastest to query the Products in a given Order, then you can store data in a denormalized way to support that, and gain some benefit.

But this makes it more awkward and inefficient to query all Orders for a given Product. If you have an equal need to make both types of queries, you should stick with the normalized design. This strikes a compromise, giving both queries similar performance, though neither will be as fast as they would be in the denormalized design that favored one type of query.

Additionally, when you store data in a denormalized way, you need to do extra work to ensure consistency. I.e. no accidental duplication and no broken referential integrity. You have to consider the cost of adding manual checks for consistency.

Bill Karwin
it does, indeed, involve a lot of extra work and yields a lot of redundancy that has to be managed and makes (in my opinion) the whole system brittle as it tends to create a lot of coupling in the business layer of the application that should typically be buried in the persistence and materialization layer. Although I doubt their original design and ideas stemmed from dimensional modeling I will read up on it, this must be the first time I hear of denormalization applied for more than surgical optimizations.
Newtopian