views:

492

answers:

7

To adhere to 1st normal form, one of the things you must avoid is repeating groups. As in instead of:

    CustID  Name  Address       Phone1      Phone2       Phone3

     102    Jerry  234 East..   555-2342   555-9854     555-2986

You should create a second Telephone Number table and then on a join you would get:

CustID  Name     Address       Phone

102 Jerry    234 East..   555-2342
102 Jerry    234 East..   555-9854
102 Jerry    234 East..   555-2986

Sometimes, it's a little more ambiguous and it's hard to tell when a group of column headers qualify. For instance, let's say you have, at the moment, two tests you run on every piece of hardware. And your first DB design yields the most horizontal approach:

Design 1

SN     Test1_Max   Test1_Min    Test1_Mean  Test2_Max   Test2_Min    Test2_Mean
2093      23          2            15         54          -24           45

Obviously, this is a repeating group, that could much more easily be represented as (on a join between "Parts" and "Tests"):

Design 2

SN     Test      Max    Min    Mean     
2093    1        23     2      15       
2093    2        54     -24     45

However, you could go even more vertical:

Design 3

SN     Test    Statistic    Value
2093    1        Max          23
2093    1        Min          2
2093    1        Mean         15       
2093    2        Max          54
2093    2        Min         -24
2093    2        Mean         45

Is Design 3 necessary? How do you decide how vertical to make it? What are the pros and cons between Design 2 and 3? It seems that both could be selected or joined easily with SQL, with the advantage given to Design 3 because you could easily add a new Statistic without actually modifying the table structure.

But before anyone goes and says that the more vertical the better, there are times where it's more ambiguous. Like:

Design 4

SN      AverageCurrent (mA)    BatteryCapacity (mA)  
2093          200                    540

Could instead be:

Design 5

SN      mA_Measuremnt       Value
2093    AverageCurrent      200 
2093    BatteryCapacity     540

While both attributes are of the same domain (mA), they represent very different things in regards to the component. In this case, is Design 4 better since it's not strictly a repeating group? I guess what I'm looking for is some criteria to knowing when to break it down into more tables and thus make it more vertical.

To sum up this ridiculously long question, should you only remove and normalize repeating groups if they are exacly the same domain and have the exact same meaning?. If that is the case, then really only the telephone example and probably the two tests in Design 1 meet this criteria. Though it seems like there might be design benefits to Design 3 and 5, even though the statistics of Design 3 have different meanings strictly speaking, and AverageCurrent and BatteryCapacity definitely have different meanings in Design 5.

A: 

When you are sure a 'test' will (ever) only have a Max, Min, and Mean -> use design 2. However, if it is possible that there will be a new 'statistic' in the future, it is better to use design 3.

An answer to:

should you only remove and normalize repeating groups if they are exacly the same domain and have the exact same meaning?

Although in many books, it looks like these normal forms are strictly defined, they are not. You should see for your own application what the best solution is... Normalizing too much is not always the best solution, especially when you see that you always join all the data back together.

Fortega
+4  A: 

Design 2 and Design 4 are the best ways to go provided the results will not always be present (aka NULLs in Desigin 1). If they always are taken, then the first design is fine.

I believe repeating groups in SQL would actually be if you have a column stuffed with add'l values e.g. Phone_Number contains "123-444-4444,123-333-3334" etc.

Anyway, the later designs are suboptimal -- you continue to take that to the final level and have the "One True Lookup Table" http://www.dbazine.com/ofinterest/oi-articles/celko22 or Entity Attribute Value http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056

Either way, it's almost always a bad thing. Although they may share a common datatype/domain, the meaning differs -- thus they should remain individual attributes (maxtemp, mintemp, etc.)

Matt Rogish
For those temperature things, I'm not so shure. You should not forget expansibility. If, at some time, you will have a new statistic, you'll have to modify the table of design 2
Fortega
Right, but there's not enough detail in his question to say whether or not this will happen. Sometimes, YAGNI... Sometimes, you might need it. :D
Matt Rogish
+1  A: 

I think of (and was taught) 1NF as 'all rows should be the same length' rather than 'no repeating groups'. With that view you can make a decision slightly more easily from the following:

In design 1, are both tests ALWAYS present? If so, then it isn't truly a repeating group. Are all the averages always present in design 2? Could there be more (or less) in a given row?

In design 4, are both those values always present? If so, it's fine. If not then design 5 should be used.

workmad3
A: 

I suggest to move repeating groups only to separate tables if they have a variable length. If you will ever have only Phone1, Phone2, and Phone3, there is no need to separate them. In the other case, if the number of repeatitons varies, the better design is a separate table.

And your concept of exact same domain and meaning is not very intuitive because it depends on the level of abstraction. Phone1 is not exactly the same as Phone2, but they are both phone numbers. You could also create a table AddressDetails and move the phone numbers there. But also the name, street, and city - they are all address details. You have to find a way between generic key value pairs and only dedicated columns.

Daniel Brückner
A: 

Design 1 is actually in 1NF if you have a PK on CustID. It might be in 3NF if no data is dependent on anything but the PK e.g. Phone1 is not repeated for other CustID.

You can't decide on the model without the business cases you are trying to solve. So Design 1 might be perfectly valid logical model.

wqw
+1  A: 

Here's the rule on repeating groups -- what is functionally dependent?

If the statistic value is functionally dependent on SN, Test and Statistic Name, then you have three key elements and one value element. ( SN, Test, Statistic -> Value )

In this specific case -- aggregated data (mean, sum, min, max) -- you have ambiguity because you're not dealing with atomic objects, you're dealing with aggregates. Strictly speaking, you shouldn't store aggregates, you should compute them. (Yes, I know it's impractical, but that's the relational theory.)

For other cases, it's usually obvious what's a key and what's a value for repeating groups. In this case, however, you're at the murky edge because your storing derivable data.

For your examples, follow the data warehouse design to locate a more pragmatic test:

Would you Slice and Dice by the other key?

Think of your statistical fact as a point surrounded by three dimensions: (SN, Test, Statistic). Is this valid? (With summary data, it's often murky.)

Instead, let's look at the detail data we should have kept: SN, Test, Score. There are clearly two dimensions (SN, Test) and one measure (score) at the intersection of those two dimensions. We can derive any number of statistics from this detailed data using either dimension (SN or Test)

For the battery example, you probably do want to create it as an EAV database instead of a more typical relational database. Your measurements (AvergaeCurrent and BatteryCapacity) give you good reasons to use an Entity-Attribute-Value database design.

Note that ALL relational design is a tension between longer relations and EAV triples. You must always balance the "is this is key" vs. "is this a column" because you can always label everything as a attribute key and use an EAV design.

S.Lott
Good example of why it's tough to diagnose DB design -- we don't know all of the business reasons :D
Matt Rogish
A: 

The design should be determined by your use-case scenarios and the type of queries you anticipate. Are you going to do a lot of reads, writes or a lot of updates? Do you wish to get the entire test data for a candidate or do you wish to get only the best test or something. What query are you going to be running most frequently?

Design 1

SN     Test1_Max   Test1_Min    Test1_Mean  Test2_Max   Test2_Min    Test2_Mean
2093      23          2            15         54          -24           45

This is the best in terms of performance. It requires no JOINs. If the number of fields is deterministic and not arbitrary (example each person has at most two test scores) then this is better albeit more rigid if you decide to associate more than two test scores to a person. Since SN is unique here for each row, the database engine can return as soon as it finds a match which is another reason why performance is better.

Design 2

SN     Test      Max    Min    Mean     
2093    1        23     2      15       
2093    2        54     -24     45

This is useful if SN 2093 can have N tests in their profile. Similarly if the number of tests is say 10m, then too this design is better than to have 30 columns. Each query and comparison will be quite heavy. This is also useful if your application requires queries where it wants to get the best performing test for student 2093 or if wishes do some analytics and reporting around test scores. This is more flexible though slightly slower than the previous one. I prefer this because I have a hunch that you will probably be interested in test statistics and students can have more than two tests each.

Design 3

SN     Test    Statistic    Value
2093    1        Max          23
2093    1        Min          2
2093    1        Mean         15       
2093    2        Max          54
2093    2        Min         -24
2093    2        Mean         45

This is useful if your queries were interested in values more than anything. For example if you are interested in how many values were greater than 80 this would be fast. In your scenario this doesn't make sense. You'll end up doing too many self JOINs. Reads will be slow! However, writes will probably be faster because you can quickly UPDATE max score for SN 2093 and Test 2 (assuming that the Statistic column is an enum instead of a string because string comparisons can be costly).

Design 4

SN      AverageCurrent (mA)    BatteryCapacity (mA)  
2093          200                    540

Design 5

SN      mA_Measuremnt       Value
2093    AverageCurrent      200 
2093    BatteryCapacity     540

Same arguments apply. It really depends on whether you intend to optimize for reads or writes? For web applications for example, if you can get away with it, I prefer Design 1. For example, I will usually know that a user will only have at most 3 phone numbers so I will make them each a field within the user column and avoid JOINs. The reads are fast even though writes will require setting some fields to null.

aleemb