views:

984

answers:

5

This is a scenario I've seen in multiple places over the years; I'm wondering if anyone else has run across a better solution than I have...

My company sells a relatively small number of products, however the products we sell are highly specialized (i.e. in order to select a given product, a significant number of details must be provided about it). The problem is that while the amount of detail required to choose a given product is relatively constant, the kinds of details required vary greatly between products. For instance:

Product X might have identifying characteristics like (hypothetically)

  • 'Color',
  • 'Material'
  • 'Mean Time to Failure'

but Product Y might have characteristics

  • 'Thickness',
  • 'Diameter'
  • 'Power Source'

The problem (one of them, anyway) in creating an order system that utilizes both Product X and Product Y is that an Order Line has to refer, at some point, to what it is "selling". Since Product X and Product Y are defined in two different tables - and denormalization of products using a wide table scheme is not an option (the product definitions are quite deep) - it's difficult to see a clear way to define the Order Line in such a way that order entry, editing and reporting are practical.


Things I've Tried In the Past

  • Create a parent table called 'Product' with columns common to Product X and Product Y, then using 'Product' as the reference for the OrderLine table, and creating a FK relationship with 'Product' as the primary side between the tables for Product X and Product Y. This basically places the 'Product' table as the parent of both OrderLine and all the disparate product tables (e.g. Products X and Y). It works fine for order entry, but causes problems with order reporting or editing since the 'Product' record has to track what kind of product it is in order to determine how to join 'Product' to its more detailed child, Product X or Product Y. Advantages: key relationships are preserved. Disadvantages: reporting, editing at the order line/product level.
  • Create 'Product Type' and 'Product Key' columns at the Order Line level, then use some CASE logic or views to determine the customized product to which the line refers. This is similar to item (1), without the common 'Product' table. I consider it a more "quick and dirty" solution, since it completely does away with foreign keys between order lines and their product definitions. Advantages: quick solution. Disadvantages: same as item (1), plus lost RI.
  • Homogenize the product definitions by creating a common header table and using key/value pairs for the customized attributes (OrderLine [n] <- [1] Product [1] <- [n] ProductAttribute). Advantages: key relationships are preserved; no ambiguity about product definition. Disadvantages: reporting (retrieving a list of products with their attributes, for instance), data typing of attribute values, performance (fetching product attributes, inserting or updating product attributes etc.)

If anyone else has tried a different strategy with more success, I'd sure like to hear about it.

Thank you.

+1  A: 

This might get you started. It will need some refinement

Table Product ( id PK, name, price, units_per_package)
Table Product_Attribs (id FK ref Product, AttribName, AttribValue)

Which would allow you to attach a list of attributes to the products. -- This is essentially your option 3

If you know a max number of attributes, You could go

Table Product (id PK, name, price, units_per_package, attrName_1, attrValue_1 ...)

Which would of course de-normalize the database, but make queries easier.

I prefer the first option because

  1. It supports an arbitrary number of attributes.
  2. Attribute names can be stored in another table, and referential integrity enforced so that those damn Canadians don't stick a "colour" in there and break reporting.
chris
A: 

Does your product line ever change?
If it does, then creating a table per product will cost you dearly, and the key/value pairs idea will serve you well. That's the kind of direction down which I am naturally drawn.

I would create tables like this:

Attribute(attribute_id, description, is_listed)    
-- contains values like "colour", "width", "power source", etc. 
-- "is_listed" tells us if we can get a list of valid values: 

AttributeValue(attribute_id, value)
-- lists of valid values for different attributes.  

Product (product_id, description)

ProductAttribute (product_id, attribute_id)  
-- tells us which attributes apply to which products

Order (order_id, etc)

OrderLine (order_id, order_line_id, product_id)

OrderLineProductAttributeValue (order_line_id, attribute_id, value)
-- tells us things like: order line 999 has "colour" of "blue"

The SQL to pull this together is not trivial, but it's not too complex either... and most of it will be write once and keep (either in stored procedures or your data access layer).

We do similar things with a number of types of entity.

AJ
A: 

Chris and AJ: Thanks for your responses. The product line may change, but I would not term it "volatile".

The reason I dislike the third option is that it comes at the cost of metadata for the product attribute values. It essentially turns columns into rows, losing most of the advantages of the database column in the process (data type, default value, constraints, foreign key relationships etc.)

I've actually been involved in a past project where the product definition was done in this way. We essentially created a full product/product attribute definition system (data types, min/max occurrences, default values, 'required' flags, usage scenarios etc.) The system worked, ultimately, but came with a significant cost in overhead and performance (e.g. materialized views to visualize products, custom "smart" components to represent and validate data entry UI for product definition, another "smart" component to represent the product instance's customizable attributes on the order line, blahblahblah).

Again, thanks for your replies!

Jared
+3  A: 

The first solution you describe is the best if you want to maintain data integrity, and if you have relatively few product types and seldom add new product types. This is the design I'd choose in your situation. Reporting is complex only if your reports need the product-specific attributes. If your reports need only the attributes in the common Products table, it's fine.

The second solution you describe is called "Polymorphic Associations" and it's no good. Your "foreign key" isn't a real foreign key, so you can't use a DRI constraint to ensure data integrity. OO polymorphism doesn't have an analog in the relational model.

The third solution you describe, involving storing an attribute name as a string, is a design called "Entity-Attribute-Value" and you can tell this is a painful and expensive solution. There's no way to ensure data integrity, no way to make one attribute NOT NULL, no way to make sure a given product has a certain set of attributes. No way to restrict one attribute against a lookup table. Many types of aggregate queries become impossible to do in SQL, so you have to write lots of application code to do reports. Use the EAV design only if you must, for instance if you have an unlimited number of product types, the list of attributes may be different on every row, and your schema must accommodate new product types frequently, without code or schema changes.

Another solution is "Single-Table Inheritance." This uses an extremely wide table with a column for every attribute of every product. Leave NULLs in columns that are irrelevant to the product on a given row. This effectively means you can't declare an attribute as NOT NULL (unless it's in the group common to all products). Also, most RDBMS products have a limit on the number of columns in a single table, or the overall width in bytes of a row. So you're limited in the number of product types you can represent this way.

Hybrid solutions exist, for instance you can store common attributes normally, in columns, but product-specific attributes in an Entity-Attribute-Value table. Or you could store product-specific attributes in some other structured way, like XML or YAML, in a BLOB column of the Products table. But these hybrid solutions suffer because now some attributes must be fetched in a different way

The ultimate solution for situations like this is to use a semantic data model, using RDF instead of a relational database. This shares some characteristics with EAV but it's much more ambitious. All metadata is stored in the same way as data, so every object is self-describing and you can query the list of attributes for a given product just as you would query data. Special products exist, such as Jena or Sesame, implementing this data model and a special query language that is different than SQL.

Bill Karwin
+2  A: 

There's no magic bullet that you've overlooked.

You have what are sometimes called "disjoint subclasses". There's the superclass (Product) with two subclasses (ProductX) and (ProductY). This is a problem that -- for relational databases -- is Really Hard. [Another hard problem is Bill of Materials. Another hard problem is Graphs of Nodes and Arcs.]

You really want polymorphism, where OrderLine is linked to a subclass of Product, but doesn't know (or care) which specific subclass.

You don't have too many choices for modeling. You've pretty much identified the bad features of each. This is pretty much the whole universe of choices.

  1. Push everything up to the superclass. That's the uni-table approach where you have Product with a discriminator (type="X" and type="Y") and a million columns. The columns of Product are the union of columns in ProductX and ProductY. There will be nulls all over the place because of unused columns.

  2. Push everything down into the subclasses. In this case, you'll need a view which is the union of ProductX and ProductY. That view is what's joined to create a complete order. This is like the first solution, except it's built dynamically and doesn't optimize well.

  3. Join Superclass instance to subclass instance. In this case, the Product table is the intersection of ProductX and ProductY columns. Each Product has a reference to a key either in ProductX or ProductY.

There isn't really a bold new direction. In the relational database world-view, those are the choices.

If, however, you elect to change the way you build application software, you can get out of this trap. If the application is object-oriented, you can do everything with first-class, polymorphic objects. You have to map from the kind-of-clunky relational processing; this happens twice: once when you fetch stuff from the database to create objects and once when you persist objects back to the database.

The advantage is that you can describe your processing succinctly and correctly. As objects, with subclass relationships.

The disadvantage is that your SQL devolves to simplistic bulk fetches, updates and inserts.

This becomes an advantage when the SQL is isolated into an ORM layer and managed as a kind of trivial implementation detail. Java programmers use iBatis (or Hibernate or TopLink or Cocoon), Python programmers use SQLAlchemy or SQLObject. The ORM does the database fetches and saves; your application directly manipulate Orders, Lines and Products.

S.Lott