views:

637

answers:

9

Let's say you're collecting insider info on upcoming superhero movie releases and your main Movie table looks something like this:

Table 1

Title              Director   Leading Male      Leading Female    Villain
--------------------------------------------------------------------------
Green Lantern      Kubrick    Robert Redford     Miley Cyrus     Hugh Grant  
The Tick          Mel Gibson  Kevin Sorbo        Linda Hunt    Anthony Hopkins

This should work very well in general and allow very easy queries as well as comparisons between rows.

However, you'd like to track the source of each data fact, as well as the name of the journalist who discovered the fact. This seems to suggest some sort of an EAV table like this:

Table 2

Movie             Attribute            Value          Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director           Kubrick         CHUD              Sarah
Green Lantern    Leading Male      Robert Redford     CHUD              James
Green Lantern   Leading Female      Miley Cyrus    Dark Horizons        James
Green Lantern      Villain           Hugh Grant       CHUD              Sarah
The Tick           Director          Mel Gibson       Yahoo            Cameron
...

Which, while it easily captures the meta-data that we wanted, makes queries harder. It takes a bit more to simply get all the basic data of a single movie. More specifically, you have to deal with four rows here to get the four important tidbits of information on the Green Lantern while in table 1 it is a single, nicely encapsulated row.

So my question is, in light of the complications I just described, and because I know in general EAV tables are to be avoided, is the EAV still the best solution? It does seems like it is the only reasonable way to represent this data. The only other alternative I see is to use table 1 in conjunction with another one that only houses meta data like this:

Table 3

Movie             Attribute            Source          Journalist
----------------------------------------------------------------------------------
Green Lantern      Director             CHUD              Sarah
Green Lantern    Leading Male           CHUD              James
Green Lantern   Leading Female      Dark Horizons         James
Green Lantern      Villain              CHUD              Sarah
The Tick           Director             Yahoo            Cameron
...

But this is very dangerous because if someone changes a column name in table 1, like "Villain" to "Primary Villain," the row in table 3 will still simply say "Villain" and thus the related data will be unfortunately decoupled. This could be helped if the "attribute" column was linked to another table that served as an enumeration of the columns of table 1. Of course, the DBA would be responsible for maintaining this enumeration table to match the actual columns of table 1. And it might actually be possible to improve this even further by instead of creating the enumeration table by hand, use a system view in SQL Server that houses the names of the columns in table 1. Though I'm not sure you can have relationships that involve system views.

What do you suggest? Is the EAV the only way to go?

And what if it was only one meta-data column (just "Source" without "Journalist") - is it still necessary to go the EAV route? You could have columns "Director," "Director_Source," "Leading Male," "Leading Male_Source," etc., but that gets ugly very quickly. Is there some better solution I'm not thinking of?

If I haven't clarified any point please comment and I'll add more as necessary. Oh yeah, and the movie data I used is fabricated :)

Edit: To restate my primary question concisely, I would like to have the simplicity and the true RDBMS design of table 1, which really describes a movie entry well, while still storing the meta data on the attributes in a safe and accessible manner. Is this possible? Or is EAV the only way?

Edit 2: After doing some more web research, I have yet to find a discussion on EAV's that centered around the desire to store metadata on the columns. The primary reason given to implement an EAV is almost always dynamic and unpredictable columns, which is not the case in my example. In my example, There are always the same four columns: director, leading male, leading female, villain. However, I want to store certain facts (source and journalist) about each column for each row. An EAV would facilitate this, but I would like to avoid resorting to that.

Update

Using the Table 2 design except for renaming the column "Movie" to "Name" and calling the whole table "Movie," here is the pivot operation in SQL Server 2008 to get back Table 1:

SELECT Name, [Director], [Leading Male], [Leading Female], [Villain]
FROM (Select Name, Attribute, Value FROM Movie) as src
PIVOT
(
Max(Value)
FOR Attribute IN ([Director], [Leading Male], [Leading Female], [Villain])
)  AS PivotTable
A: 

My response may seem a bit too philosophical for SO. Bear with me.

I think that the "Source" column isn't subject matter data, but rather meta-data. It's really data about how we come to know some other bit of data. That makes it data about data, and that's meta-data.

Among the reasons why EAV causes the problems that it does is the fact that it intermixes data and metadata in a single row. There are times when I've deliberately done that myself, as an intermediate step towards a result I want to acheive. But I've tried never to intermix data and metadata in my deliverables.

I know why I never did that, but I can't explain it concisely.

Walter Mitty
A: 

Since no one else is really taking a crack at it, I'm going to answer my own question. I'm pretty sure an EAV-like table is indeed the only way to go. To store metadata on each column (regarding the source and journalist in this case), you're really treating each column as an entity in itself, which is what an EAV allows.

You could go other routes, like adding a second and third column for each original column to store data, but that is definitely breaking some fundamental normalization rules and will probably only cause you pain later.

JoeCool
A: 

Hmm.... I've not used this, so I'm not speaking from experience (i.e. don't blame me if it doesn't work), but on the surface it seems that you could store "common" data that you know will always be there as you would in a normal table, and "metadata" that might change as XML. The question then is how to query it nicely, and I think you might be able to do that as described HERE.

Michael Bray
+1  A: 

Interesting scenario. You could get around the EAV ghetto-ness by thinking about your entities as first class objects; let's call them Facts. And it helps that you're pretty orthogonal in this case, in that every movie has the exact same four facts. Your EAV table can be your pristine/correct table, and then you can have an outside process that mines that table and replicates the data into a properly normalized form (i.e. your first table). This way you have the data you want, with its meta data, and, you have an easy way to query for movie information, accurate to how often your mining process runs.

I think you definitely need some "out-of-database" muscle to make sure the data remains valid, since there doesn't seem to be any in-database way of maintaining integrity across your regular and EAV tables. I guess with a complex series of triggers you can pretty much accomplish anything, but one human administrator who "gets" your problem is probably much easier to handle.

Mark Canlas
+1  A: 

Here's another idea...feel free to punch holes in it :)

Table: Movie
Columns: MovieId|Movie|Director|LeadMale|LeadFemale|Villain

Table: MovieSource
Columns: MovieSourceId|MovieId|MovieRoleId|Source|Journalist

Table: MovieRole
Columns: MovieRoleId|MovieRole
Values: 1|Director, 2|LeadMale, 3|LeadFemale, 4|Villain

What I'm thinking is that the columns in the movie table could be of different types (in your example, they are all strings/varchars, but they could be, say, numerical or date information that also has a source).

The column types for the source data, however, probably wouldn't vary as a function of the column types of the movie data, so you could use more of an EAV system for the source without losing the integrity of your data.

The MovieRole table allows you to explicitly enumerate the roles so you can create a sure linkage between the source and a given cell of the movie table.

-Dan

DanM
Additional thoughts: If you rename the "Villain" column "PrimaryVillain", it won't harm the linkages (because they are linked by id not by name). If you add a column to the Movie table, this would require adding a record to MovieRole, but it wouldn't break any existing data (and it would be noticed fairly quickly I would think). If you remove a column from the Movie table, you would of course lose the data for that column, but the sources for that data wouldn't be deleted unless and until you remove the MovieRole (and cascade the deletion).
DanM
I like this solution and it's been one I had been thinking about even before you posted. You could genericize your approach if you call "MovieRole" simply "MovieAttributes," and now the table is simply an enumeration of all the columns (except identifying ones) of the Movie table.
JoeCool
+1  A: 

Seeing as you only have the two fields for source data (Source and Journalist), I would recommend a meta-data table like this:

Movie    DirectorSource  DirectorJournalist  LeadingMaleSource  LeadingMaleJournalist ...
---------------------------------------------------------------------------------------
The Tick   Yahoo           Cameron           ...                ...

This will keep the less important source data out of the main table, but the queries will not get complicated and your code will be more readable.

I would only advise EAV if ...

  • You have more than 3 fields of source meta-data
  • You need to be able to add or change movie fields easily. (changes like 'Villain' to 'Primary Villain' are being done several times per day)
too much php
A: 

Another approach to consider is Class Table Inheritance. Bill Karwin has a great review of EAV options in this SO answer, and lots of good context.

Antony
I appreciate the link, but I'm not sure how I could implement Class Table inheritance here. The example in the question you linked has to do with dealing with common attributes vs. specific attributes and where to house each, and class table inheritance is one answer. However, my question deals with common attributes vs. attributes describing those attributes (metadata). Because of this, and because of reasons I stated in my original question, I think my situation is unfortunately quite a different beast.
JoeCool
Sorry - should have left a longer comment the first time. I'm thinking that the inheritance table would handle the meta data. eg t_MetaData has an FK back to the main table, a metaType column (leading male, leading female, etc), a source and a journalist column.
Antony
Yes, you could do it that way. I think what you're describing is my "Table 3" approach, but that approach introduces a new headache. The metaType column's values need to be constrained to the names of the descriptor columns in Movie. So now you have to make sure that, if there's any change in the names or number of descriptor columns in Movie, this change is propagated to the rows in the t_MetaData. You can ensure this effect with a bit of code, but it leaves an uneasy feeling in my stomach :)
JoeCool
A: 

I would make my decision based on what I need to code.

If src/journo is simply additional info, I would go for further columns. But if I know I'm going to end up building complicated src/journo queries, I would go EAV, as it'll be easier to search for a journalist's references down the meta table than having to go into LeadingFemaleJournalist and VillainJournalist etc.

Personally - I would be inclined dump the src/journo meta-data into another table EAV-style, but use a FK to define an Attribute definition table. Having a freeform Attribute text field is a recipe for disaster - always control your attributes through a constraint. Triggers could be implemented to improve referential integrity if required.

For me, it comes down to point-of-view. Do you see sources and journalists being relational concerns in their own right or are they just additional pieces of data to complement a Movie? The next level of refinement would be to create different tables for MovieDataSource and MovieDataJournalist which could allow you to map FKs to a tables defining valid Sources and Journalists (and further information on those Sources/Journalists could then be fleshed out). What you will have done here is to establish a many-to-many relationship between the Movie entity and the Source (and also Journalist) entity.

Joel Goodwin
+3  A: 

Your can change what you consider a fact value in your design ... it seems that a fact in your data model could be expressed as the following N-tuple:

Movie | FactType | FactValue | FactSource | FactJournalist

The following table structures should support the data model you want, and can relatively easily be indexed and joined. You can also create a view that pivots out just the fact value and fact type so that you can create the following perspective:

MovieID | Movie Name | Director | LeadingMale | LeadingFemale | PrimaryVillain | etc

Interestingly, you could consider this to be the logical extension of fully applying an EAV model to the data, and decomposing an individual movie (with it's intuitive attribution of director, lead, villain, etc) into a pivoted structure where attributes focus on the source of the information instead.

The benefits of the proposed data model are:

  • it is well-normalized (though you should probably normalize the FactType field into a reference table for completeness)
  • it is possible to create a view that pivots fact types efficiently out into a tabular structure
  • it is relatively extensible and allows the database to enforce referential integrity and (if desired) cardinality constraints
  • the MovieFact table can be subclassed to support different kinds of movie facts, not just those that are simple text field
  • simple queries against the data are relatively efficient

Some of the disadvantages of the data model are:

  • Composite, conditional queries are harder (but not impossible) to write (e.g. find all movies where Director is A and Leading Male is B, etc...)
  • The model is somewhat less obvious than the more traditional approach, or one involving EAV structures
  • inserts and updates are a little trickier because updating multiple facts requires updating multiple rows, not multiple columns

I've the Movie data up a level to normalize the structure, and you could pushed the movie name down into the MovieFact structure for consistency (since for some movies I can imagine even then name is something you may want to track source information for).

Table Movie
========================
MovieID   NUMBER, PrimaryKey
MovieName VARCHAR

Table MovieFact
========================
MovieID          NUMBER,  PrimaryKeyCol1
FactType         VARCHAR, PrimaryKeyCol2
FactValue        VARCHAR
FactSource       VARCHAR
FactJournalist   VARCHAR

Your fictional movie data would then look like the following:

Movie Table
====================================================================================
MovieID  MovieName
====================================================================================
1        Green Lantern
2        The Tick

MovieFact Table
====================================================================================
MovieID  FactType       FactValue         FactSource       FactJournalist
====================================================================================
1        Director       Kubrick           CHUD             Sarah
1        Leading Male   Robert Redford    CHUD             James
1        Leading Female Miley Cyrus       Dark Horizons    James
1        Villain        Hugh Grant        CHUD             Sarah
2        Director       Mel Gibson        Yahoo            Cameron
2        Leading Male   John Lambert      Yahoo            Erica
...
LBushkin
+1 - exactly what I was thinking. Much better not to compromise on normalisation in this situation. The part worth stressing is that views can make this easy to query against. Plus, when column names have to change, new views can be created meaning you don't break your existing interface.
Chris Simpson
Thank you for the very in-depth response. This is indeed the form my company has decided to go with (at least, until they change their mind haha). We decided on it because, in the end, we are treating a fact about a movie as a first-class entity because of the journalist and source, so we should model it that way in the database. It hurts a bit because of losing the more obvious representation of a movie (though like you said, we can pivot it), but it simply works best this way.
JoeCool