views:

386

answers:

7

My employer, a small office supply company, is switching suppliers and I am looking through their electronic content to come up with a robust database schema; our previous schema was pretty much just thrown together without any thought at all, and it's pretty much led to an unbearable data model with corrupt, inconsistent information.

The new supplier's data is much better than the old one's, but their data is what I would call hypernormalized. For example, their product category structure has 5 levels: Master Department, Department, Class, Subclass, Product Block. In addition the product block content has the long description, search terms and image names for products (the idea is that a product block contains a product and all variations - e.g. a particular pen might come in black, blue or red ink; all of these items are essentially the same thing, so they apply to a single product block). In the data I've been given, this is expressed as the products table (I say "table" but it's a flat file with the data) having a reference to the product block's unique ID.

I am trying to come up with a robust schema to accommodate the data I'm provided with, since I'll need to load it relatively soon, and the data they've given me doesn't seem to match the type of data they provide for demonstration on their sample website (http://www.iteminfo.com). In any event, I'm not looking to reuse their presentation structure so it's a moot point, but I was browsing the site to get some ideas of how to structure things.

What I'm unsure of is whether or not I should keep the data in this format, or for example consolidate Master/Department/Class/Subclass into a single "Categories" table, using a self-referencing relationship, and link that to a product block (product block should be kept separate as it's not a "category" as such, but a group of related products for a given category). Currently, the product blocks table references the subclass table, so this would change to "category_id" if I consolidate them together.

I am probably going to be creating an e-commerce storefront making use of this data with Ruby on Rails (or that's my plan, at any rate) so I'm trying to avoid getting snagged later on or having a bloated application - maybe I'm giving it too much thought but I'd rather be safe than sorry; our previous data was a real mess and cost the company tens of thousands of dollars in lost sales due to inconsistent and inaccurate data. Also I am going to break from the Rails conventions a little by making sure that my database is robust and enforces constraints (I plan on doing it at the application level, too), so that's something I need to consider as well.

How would you tackle a situation like this? Keep in mind that I have the data to be loaded already in flat files that mimic a table structure (I have documentation saying which columns are which and what references are set up); I'm trying to decide if I should keep them as normalized as they currently are, or if I should look to consolidate; I need to be aware of how each method will affect the way I program the site using Rails since if I do consolidate, there will be essentially 4 "levels" of categories in a single table, but that definitely seems more manageable than separate tables for each level, since apart from Subclass (which directly links to product blocks) they don't do anything except show the next level of category under them. I'm always a loss for the "best" way to handle data like this - I know of the saying "Normalize until it hurts, then denormalize until it works" but I've never really had to implement it until now.

+2  A: 

If I understand correctly, you want to take their separate tables and turn them into a hierarchy that's kept in a single table with a self-referencing FK.

This is generally a more flexible approach (for example, if you want to add a fifth level), BUT SQL and relational data models don't tend to work well with linked lists like this, even with new syntax like MS SQL Servers CTEs. Admittedly, CTEs make it much better though.

It can be difficult and costly to enforce things, like that a product must always be on the fourth level of the hierarchy, etc.

If you do decide to do it this way, then definitely check out Joe Celko's SQL for Smarties, which I believe has a section or two on modeling and working with hierarchies in SQL or better yet get his book that is devoted to the subject (Joe Celko's Trees and Hierarchies in SQL for Smarties).

Tom H.
Yes that's my main worry about it - if I keep the tables separate then I don't have to worry about it and can traverse easier than having a self-referencing table. A product block can only belong to a subclass, for example, and a subclass has to have a parent class, etc.
Wayne M
+4  A: 

I would prefer the "hypernormalized" approach over a denormal data model. The self referencing table you mentioned might reduce the number of tables down and simplify life in some ways, but in general this type of relationship can be tricky to deal with. Hierarchical queries become a pain, as does mapping an object model to this (if you decide to go that route).

A couple of extra joins is not going to hurt and will keep the application more maintainable. Unless performance degrades due to the excessive number of joins, I would opt to leave things like they are. As an added bonus if any of these levels of tables needed additional functionality added, you will not run into issues because you merged them all into the self referencing table.

Jim Petkus
Very good advice, I will certainly consider it. Thanks! It would make the data easier to import as well since it is already in that particular format.
Wayne M
+1  A: 

I would bring it in as close to their model as possible (and if at all possible, I would get files which match their schema - not a flattened version). If you bring the data directly into your model, what happens if data they send starts to break assumptions in the transformation to your internal application's model?

Better to bring their data in, run sanity checks and check that assumptions are not violated. Then if you do have an application-specific model, transform it into that for optimal use by your application.

Cade Roux
Staging source data with perfect fidelity is always a good idea. I think the OP was going to do that anyways though.
entaroadun
+2  A: 

Normalization implies data integrity, that is: each normal form reduces the number of situations where you data is inconsistent.

As a rule, denormalization has a goal of faster querying, but leads to increased space, increased DML time, and, last but not least, increased efforts to make data consistent.

One usually writes code faster (writes faster, not the code faster) and the code is less prone to errors if the data is normalized.

Quassnoi
Another benefit of a normalised model is flexibility: in an app with evolving requirements a flexible data model is a tremendous boon. As the app matures, aspects may be denormalised for performance (if absolutely necessary) with reduced risk.
Mike Woodhouse
+2  A: 

Self referencing tables almost always turn out to be much worse to query and perform worse than normalized tables. Don't do it. It may look to you to be more elegant, but it is not and is a very poor database design technique. Personally the structure you described sounds just fine to me not hypernormalized. A properly normalized database (with foreign key constraints as well as default values, triggers (if needed for complex rules) and data validation constraints) is also far likelier to have consistent and accurate data. I agree about having the database enforce the rules, likely this is part of why the last application had bad data because the rules were not enforced in the proper place and people were able to easily get around them. Not that the application shouldn't check as well (no point even sending an invalid date for instance for the datbase to fail on insert). Since youa redesigning, I would put more time and effort into designing the necessary constraints and choosing the correct data types (do not store dates as string data for instance), than in trying to make the perfectly ordinary normalized structure look more elegant.

HLGEM
A: 

I totally disagree with the criticisms about self-referencing table structures for parent-child hierarchies. The linked list structure makes UI and business layer programming easier and more maintainable in most cases, since linked lists and trees are the natural way to represent this data in languages that the UI and business layers would typically be implemented in.

The criticism about the difficulty of maintaining data integrity constraints on these structures is perfectly valid, though the simple solution is to use a closure table that hosts the harder check constraints. The closure table is easily maintained with triggers.

The tradeoff is a little extra complexity in the DB (closure table and triggers) for a lot less complexity in UI and business layer code.

entaroadun
A: 

Don't denormalize. Trying to acheive a good schema design by denormalizing is like trying to get to San Francisco by driving away from New York. It doesn't tell you which way to go.

In your situation, you want to figure out what a normalized schema would like. You can base that largely on the source schema, but you need to learn what the functional dependencies (FD) in the data are. Neither the source schema nor the flattened files are guaranteed to reveal all the FDs to you.

Once you know what a normalized schema would look like, you now need to figure out how to design a schema that meets your needs. It that schema is somewhat less than fully normalized, so be it. But be prepared for difficulties in programming the transformation between the data in the flattened files and the data in your desgined schema.

You said that previous schemas at your company cost millions due to inconsistency and inaccuracy. The more normalized your schema is, the more protected you are from internal inconsistency. This leaves you free to be more vigilant about inaccuracy. Consistent data that's consistently wrong can be as misleading as inconsistent data.

Walter Mitty