views:

267

answers:

3

The major shortcomings with Entity-Attribute-Value database designs in SQL all seem to be related to being able to query and report on the data efficiently and quickly. Most of the information I read on the subject warn against implementing EAV due to these problems and the commonality of querying/reporting for almost all applications.

I am currently designing a system where the fields for one of the entities are not known at design/compile time and are defined by the end-user of the system. EAV seems like a good fit for this requirement but due to the problems I've read about, I am hesitant in implementing it as there are also some pretty heavy reporting requirements for this system as well. I think I've come up with a way around this but would like to pose the question to the SO community.

Given that typical normalized database (OLTP) still isn't always the best option for running reports, a good practice seems to be having a "reporting" database (OLAP) where the data from the normalized database is copied to, indexed extensively, and possibly denormalized for easier querying. Could the same idea be used to work around the shortcomings of an EAV design?

The main downside I see are the increased complexity of transferring the data from the EAV database to reporting as you may end up having to alter the tables in the reporting database as new fields are defined in the EAV database. But that is hardly impossible and seems to be an acceptable tradeoff for the increased flexibility given by the EAV design. This downside also exists if I use a non-SQL data store (i.e. CouchDB or similar) for the main data storage since all the standard reporting tools are expecting a SQL backend to query against.

Do the issues with EAV systems mostly go away if you have a seperate reporting database for querying?

EDIT: Thanks for the comments so far. One of the important things about the system I'm working on it that I'm really only talking about using EAV for one of the entities, not everything in the system.

The whole gist of the system is to be able to pull data from multiple disparate sources that are not known ahead of time and crunch the data to come up with some "best known" data about a particular entity. So every "field" I'm dealing with is multi-valued and I'm also required to track history for each. The normalized design for this ends up being 1 table per field which makes querying it kind of painful anyway.

Here are the table schemas and sample data I'm looking at (obviously changed from what I'm working on but I think it illustrates the point well):

EAV Tables

Person
-------------------
-  Id - Name      -
-------------------
- 123 - Joe Smith -
-------------------

Person_Value
-------------------------------------------------------------------
- PersonId - Source - Field       - Value         - EffectiveDate -
-------------------------------------------------------------------
-      123 -    CIA - HomeAddress - 123 Cherry Ln -    2010-03-26 -
-      123 -    DMV - HomeAddress - 561 Stoney Rd -    2010-02-15 -
-      123 -    FBI - HomeAddress - 676 Lancas Dr -    2010-03-01 -
-------------------------------------------------------------------

Reporting Table

Person_Denormalized
----------------------------------------------------------------------------------------
-  Id - Name      - HomeAddress   - HomeAddress_Confidence - HomeAddress_EffectiveDate - 
----------------------------------------------------------------------------------------
- 123 - Joe Smith - 123 Cherry Ln -                  0.713 -                2010-03-26 -
----------------------------------------------------------------------------------------

Normalized Design

Person
-------------------
-  Id - Name      -
-------------------
- 123 - Joe Smith -
-------------------

Person_HomeAddress
------------------------------------------------------
- PersonId - Source - Value         - Effective Date - 
------------------------------------------------------
-      123 -    CIA - 123 Cherry Ln -     2010-03-26 -
-      123 -    DMV - 561 Stoney Rd -     2010-02-15 -
-      123 -    FBI - 676 Lancas Dr -     2010-03-01 -
------------------------------------------------------

The "Confidence" field here is generated using logic that cannot be expressed easily (if at all) using SQL so my most common operation besides inserting new values will be pulling ALL data about a person for all fields so I can generate the record for the reporting table. This is actually easier in the EAV model as I can do a single query. In the normalized design, I end up having to do 1 query per field to avoid a massive cartesian product from joining them all together.

+2  A: 

The problem with EAV is not due to EAV as such. It's due to designing and building a database without understanding what the data requirements really are, and what logical structure the data must have in order to meet these requirements. EAV, and any other system that lets the users design their own data, turns this on its head.

In this scheme, first we come up with a system that lets users store any kind of data whatsoever, regardless of its structure, and regardless of the future intended use. Then, when it's time to get the reports out, we have to figure out what we've got, and how that relates to what we need.

Good luck with that.

Walter Mitty
+3  A: 

@Walter (sorry for not commenting but there simply isn't any comment button for me to push) :

"In this scheme, first we come up with a system that lets users store any kind of data whatsoever, regardless of its structure, and regardless of the future intended use. Then, when it's time to get the reports out, we have to figure out what we've got, and how that relates to what we need."

Since you clearly attribute the nature of the problem to "being in this scheme", it really seems to me as if the problem with EAV really *IS* due to EAV as such.

In fact, come to think of it : "a system that lets users store any kind of data whatsoever" is the equivalent of a system that allows users to just define their relvars. But what portion of that system allows the users to define their constraints ? Oooooooooops, the EAV crowd seems to have missed a not-so-unimportant aspect of data management, it seems ...

Erwin Smout
+1 (You need 50 rep to comment!)
Martin Smith
You make some good points, and you might be right about EAV. Can one user declare constraints that constrain another user? If the answer is yes, then "users define their own data" collapses after a while. If the answer is no, then making sense out of the data from the entire user community involves integrating things the users have not integrated. I see that as a problem whether you use EAV or relvars or something else.
Walter Mitty
A: 

Short answer - yes, a reporting database is a reasonable approach to solving the problems of reporting from an EAV data model.

I spent a number of years working with an information management solution which allowed end users complete freedom to define their own data model, with both the schema and the data stored using an EAV model. Interestingly, this product provided meta-schema objects used to fulfill reporting requirements (e.g. graphs to provide object navigation, views to perform projection, etc.). This meant that the end user was free to define queries using the same terms and concepts that they'd used to build the data model in the first instance. The act of reporting was essentially to compute the data set by navigating these definitions, and hand the result over to a traditional report writing tool as if it were relational data.

One of the strengths of this approach was that the same mechanism that was already in place to transform the EAV model to something the user could work with could be reused and applied to the reporting function.

Pat