views:

93

answers:

7

I am trying to weigh up the relative pros and cons of a simple database structure such as this:

1.

CREATE TABLE x (
    my_id INT PRIMARY KEY,
    ...,
    text_attribute_blah TEXT,
    text_attribute_blah_blah TEXT
);

vs:

2.

CREATE TABLE x (
    my_id INT PRIMARY KEY,
    ...
)

CREATE TABLE attributes (
    my_id INT,  /* foreign key to x.my_id */
    text_attribute_type INT,
    text_attribute TEXT
)

Where attribute_type could be blah or blah_blah.

Option 1 offers simplicity - the table is easier to read/write; Option 2 offers flexibility (if we want to add another attribute such as blah_blah_blah, we don't need to make schema changes and so probably fewer code changes.)

Is there a right/wrong answer to this conundrum? Is one of these options considered better practice than the others? Can you point me at further reading that might help be determine the way forward?

+8  A: 

I'd almost always choose #1 - I just prefer to have attributes as columns in my tables - makes querying, indexing for performance and the general handling much easier and more transparent.

the #2 option is called EAV - Entity Attribute Value - and it has some major drawbacks - see

marc_s
I'll add this linkhttp://www.simple-talk.com/opinion/opinion-pieces/bad-carma/
HLGEM
Stop the EAV insanity!
Tuzo
+2  A: 

Option 1 almost every time. Option 2 is very inefficient. It is also quite clumsy to query easily when you have to do something with more efficient. Having said that, I have seen a number of products that do this for user defined attributes. Examples of systems that use the option 2 technique are Agresso and Kalido.

If you're doing a bespoke application the best way by far to add attributes is simply to extend the database schema when you need to. As the change will be accompanied by modifications to the code it can be done as a part of the release process.

If you're doing a packaged application that you intend customers to configure themselves you have three broad approaches that you can take.

  1. EAV structure like option 2. This is flexible, but is inefficient to query, particularly as the queries get complex with multiple joins.

  2. Make a set of 'User' fields (User1, User2 etc.) on the tables. This limits you to a finite number, but this can be quite large (you could have User01-User99 if you wanted). However, it is the most efficient and simplest to query. The other con is that the fields are somewhat opaque. You have to have access to configuration information to know the meaning of 'User3'. It also sacrifices some type safety. On balance, however, your user field mechanism is going to have some of its own metadata and a generic framework of some sort, so some of that type safety can be provided through this.

    This looks the most inelegant but is the best way to do this in most cases as it has the best performance and simplest queries. It is by far the easiest scheme to work with.

  3. XML. This is infinitely flexible but most of the tooling surrounding the database does a poor job of working with XML. It also stores the XML in separate allocation units from the main table, so it can cause significant issues with query performance. XML based strategies are very application-centric at the expense of other consumers of the data.

    In my experience storing significant amounts of data in XML fields in a database will significantly increase your application's TCO. Not recommended for user data fields in most cases.

ConcernedOfTunbridgeWells
+4  A: 

It's interesting that you don't mention either performance or data integrity as concerns. For what it's worth, model #1 is the best approach for those considerations.

Flexibility is vastly over-rated with regards to data models. Most table structures are well-known at the start of development and remain stable throughout the lifetime of a database. If you have an application where the model is genuinely fluid and unknowable then probably you should not be using an RDBMS at all. Choose one of the NoSQL products instead.

So that's another vote for #1.

APC
Since performance and data integrity are the two most critical elements of database design (security being the third) you get a +1 from me.
HLGEM
+1  A: 

@marc_s I don't believe one can "almost always" make any one selection among above options. There is a case to support both the solutions.

Option #1 Go for this when the entity X is well defined i.e. you know exactly what you need to capture in order to define X. In such a case one single record of X pretty much captures everything an instance of X stands for.

Option #2 Go for this when such an entity X can not be completely defined i.e. you dont know what set attributes are required to define it "completely".

For e.g. take a example of employee record as mentioned in article "Five simple database design errors you should avoid" [link provided by @marc_s]. Yes!!! you will be tempted to got for Option 1 but if you consider the case of employees working in large organizations, once single record the employee information - both its definition and content is highly dynamic and the combination of option#1 and option#2 required.

shreeneewas
I still believe that in more than 90% of the cases, I cannot see any good reason for option #2, considering all the negatives it has (data integrity, performance, clumsy querying)... if you don't need a particular attribute - make it nullable. If you have blocks of attributes for certain employees - but them in a separate FK-linked table - I have yet to find a really compelling reason for an EAV...
marc_s
My answer to your comment as next answer.
shreeneewas
+3  A: 

Every solution has a problem to solve. #1 will be a good approach if you know the columns that you need upfront. However, in some cases, the columns are not known upfront. For example, custom fields that a user adds to a functionality.

Having said that, EAVs have abundance of problems. When used properly, IMO, they are useful.

  1. Make sure you do not create a EAV for everything. It is only for "unknown items".
  2. Remember that EAVs do not have foreign-key relationships to depend on.
  3. Performance is low because of non-trivial queries, and maintenance may be more.
  4. Keep in mind that the EAVs has to be pivoted to make it meaningful (well, most often).
Josh
A: 

As it was said earlier, it depends on your requirements. You should choose #2 only if you need, for example, add new types of attributes as a part of your program workflow. Doing this with adding new columns in your tables is certainly worse than have one extra table and one extra join in your queries.

Haspemulator
+1  A: 

@marc_s

Though I mentioned the example of employee record I am sure that is not very convincing.

Here is the example from financial domain.

If you want to capture all attributes of a deal then it depends on its type of instrument. It is lot easier to capture most Forex, Money Market even Bond instruments as they very structured. But as we move towards derivative products it becomes very cumbersome. They are very exotic in nature and keep changing in terms of structure (hence its meaning etc). To capture such a dynamically changing information we should opt for EAV. Ofcourse while making this choice one should be aware that it brings lot of negatives listed above in your comment.

I can not talk about other domains, but I am sure you will find that IT systems in lot of business domains face this situation and hence having a good understanding of EAV strategy - as oppose to its outright rejection - will be a good idea.

Shrini

shreeneewas