views:

152

answers:

5

I have a form where users submit different fields to create events. The number and type of fields requested are different on each form, depending on the category of event. What is the best way of going about designing this database - should the events contain every possible field and simply null the unused fields? Thanks!

A: 

That is an answer, but if you want to have it good, you can create separate tables for different types of events, or simply serialize data do one field and save it. That would make searching harder, but is an option.

As an answer I would say, that if there will be many NULLs, do the separate tables, if not, make many fields in one.

Tomasz Kowalczyk
A: 

It depends on how drastically different your forms are. I say have different fields for each entry item...having 1 field with multiple items in it will just make queries much more difficult. If your forms aren't too different, then 1 table with each possible field would be ok, but if your table has 20+ fields in it, then I'd suggest splitting those tables up. I'd also recommend a header table with a "form type" field to help searches as well.

Aaron
This is clearly the right answer from top to bottom... my answer is to discuss the cons of other choices and how you could do it better.
Stephanie Page
A: 

You should normalize your table as far as you can to reduce the number of nulls in the database. Records should be meaningful if they're stored. One method might be to have a table of categories that relates 1->m to a table of events. Then you could have a table of expected fields in forms (giving each an int id). Then an intermediate table would store the actual submitted data.

catID|Category
 |
 --------------------
                    |
eventID | event | catID
 |
 ------------------------------
                     |        |
fldID | fldName | eventID     |
 |                            |
 -----------        -----------
           |        |
dataID | fldID | eventID | data
Joel Etherton
+1  A: 

I'd think carefully about this abstraction, but you can also have a linked table that contains the event details:

Table Event:  id, Name

Table EventDetail:  id, EventID, DetailFieldName, DetailText

One Event record can have many EventDetail records.

This is flexible, but again there are tradeoffs to consider. Your queries will get more complicated, and presenting the results have an additional layer of indirection (you have to loop through all of the EventDetail records for a particular Event record to present the whole thing).

You can go all out and also normalize out the DetailFieldName into an EventDetailField table if you want.

However, you do end up with only a couple of tables, you can get rid of NULLs completely if you want, and you don't have to design a new table for each particular event type.

Choose your poison. ;) Normalizing has its place, but I've also found that it makes certain tasks very difficult if you normalize too much.

John at CashCommons
If he has a lot of events and they have a lot of detail fields, he's going to get a lot of string-based redundancy in the EventDetail table. I'm also for normalizing only so far as you need to, but in this case I don't think he'd lose anything by going the next step, though indexing will be a little easier by breaking it out.
Joel Etherton
Joel: Right, that's true. I've burned myself normalizing every last thing out in the (not too distant) past, so maybe I'm pulling to the other extreme now and not normalizing enough. Having too many levels of normalization makes for really nasty queries. With my example above there's only one JOIN. Then again, normalizing out the type would save space in the long run. Nothing's free. :)
John at CashCommons
+4  A: 

If you begin to consider Joel's advice, please go to here.

or here

or here

And if you don't believe any of them, build the 4 tables he mentions. There's only 4, doesn't take long. Then load some data into them... then try to write the queries you want to write...

Changing Column meaning:

This can really screw with cardinality estimates. You dinner plates might be in the 4 - 20 range, the concert seats between 1000 - 2000. Some cardinality calculations look at the spread from min to max and assume and equal distribution (when lacking other statistics)...

From 4 to 2000 means that anywhere GENERIC_COLUMN = n, the % of rows you'll hit is 1/1996th of the total... but really, if you said where EVNT_TYPE = Dinner and GENERIC_COLUMN = n it would REALLY be between 4 and 20, or 1/16th of the total rows... so a huge swing in the card estimate. (This can be fixed with histograms, but the point of showing the automation issues is just to hightlight that if it's an issue to a machine, it's probably not as clean as it could be.)

So if you were to do this (MUCH BETTER than an EAV but...)

I would recommend creating a view for each object.

Table EVENT ( common fields, Generic_Count) View DINNER ( common fields, Generic_Count as Plates) WHERE type = Dinner View CONCERT ( common fields, Generic_Count as Seats) WHERE type = Concert

Then give NO ONE select against EVENT

But this is where you get into trouble by NOT starting with a conceptual data model first.

You'd have an ENTITY for EVENT and another for DINNER which inherits completely from EVENT and another for CONCERT which inherits completely from EVENT. Then you could set a differentiating column in the inheritance object which let's you set the "TYPE" column and then you could even decide how many tables to build with a flick of a switch. 1 table, 2 tables or 3 tables..

At least you can do that in powerDesigner.

Why is DDL considered so 'bad?

The creation of EAV models and questions like this are organized around the idea that DDL is to be avoided. Why ALTER TABLE when you can INSERT a new attribute row? People make poor data model design decisions based on the wrong Utility Function. These functions are things like 'no nullable columns', 'the fewer the tables the better', 'no ddl just to add a new attribute. Insert into Attribute table instead'.

Think of data modeling like this: sculptors will say that the wood or stone already has the figure inside of the block, they are just removing pieces of it to reveal it.

Your problem space already has a data model, it's just your job to discover it... it will have as many tables and columns as it needs. Trying to force it to conform to one of the above utility functions is where things go horribly wrong.

In your case, would you ever like to know all the events you've added in the past 2 weeks? Now think of the possible models. One table per event type would mean summing over n tables to find that answer and with each new event type a new table added and every "All event" query would be changing. You could build a UNION ALL view of those tables but you'd have to remember to add each new table to the view. Debugging through views like that is a pain.

Assuming that you might want a lot of metrics about ALL events, one table makes more sense (At least for some common portion of your event data - Like Event Name, Sponsor ID, Venue ID, event Start Time, event end time, venue available for setup time, etc.) Those field are (let's stipulate) are common to every event.

So now what to do with the other columns? Two options, nullable fields or vertically partition the table. The later is an optimization of the former. And if you read any database optimization books or blogs the major thing I take from them is that premature optimization kills. I see people implementing lots of strategies for problems before they even know if they will have that problem. A coworker had a slow query he wanted me to help with. It was loaded with optimizer hints. I removed them and the SQL screamed... I don't know WHY he hinted it but he was not doing it effectively and I'm pretty sure he never saw an issue so this was all just premature optimization.

Vertical partitioning is something you do when you have large data volumes and you have some frequently accessed data and other data that is not so useful. You can pack a table with a lot fewer blocks if you only pack some of the commons. More rows per block = faster tablescans... doesn't really affect the speed of finding a single row via an index. As you can see vertical partitioning has a specific problem it can solve (others too like row chaining) so if you're sure that's GOING to be an issue then by all means begin that way.

Stephanie Page
What if I restricted the generic columns to booleans, tinyints within a specific range to represent multiple choice questions, and short varchar descriptions. Could you foresee any problems with this method?
cph
There's a fundamental problem with this approach. I used one example of how it can go wrong but it is by no means the only one. Working around this ONE issue doesn't change the fact that RDBMS's don't work well like this and neither do apps or anything thing else you want to do. I'm adding more comments above.
Stephanie Page
Even booleans can be a problem. (Especially if your RDBMS doesn't support multi-column statistics like Oracle 11g). Say you have 2 things in a table and in that bool col, 99 foo's have a T there and 1 F and for each bar there are 99 F and 1 T. It looks like that bool col has a 50/50 chance for any row but that's not really true since there is only ONE F Foo and ONE T Bar.
Stephanie Page
excellent explanation, thanks for your help!
cph
+1 Great answer, and great links except that you didn't include this one: http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html ;-)
Tony Andrews
@Tony, I'm thinking about a meta blog for all the blog posts out there and my own which rail against the EAV.
Stephanie Page