ansaurus

Question

is my data normalized?

Answer 1

+3 A:

You should name ids after their tables, e.g., question_id, media_id, etc. (In their primary key locations, that is, not just when they are used as foreign keys.) This makes joins easier, since you can say things like FROM question JOIN MEDIA USING (question_id)).
The question table is missing a subtopic_id.
You may want to replace level with level_id, having a foreign key on a LEVEL table, but that's not a hard-and-fast rule, just a suggestion. It's more flexible, but also more work; it ultimately comes down to what makes sense for your problem domain.

Other than those minor errata, it looks fine to me.

Marcelo Cantos 2010-10-31 04:24:23

Answer 2

+4 A:

Since Question belongs to a subtopic, the Question table should have a foreign key to the subtopic id

pinkfloydx33 2010-10-31 04:24:35

Answer 3

+4 A:

Sure, this is normalized and looks fine given your requirements. However, you may have left off the subtopic_id on the question table.

BobbyShaftoe 2010-10-31 04:25:39

Answer 4

+6 A:

Great question, and that you ask it before building the app speaks volumes for your ability to learn and improve your data modelling skills.

Limitations of Your Model

This is normalised in the simple sense: you've already decided what the entities are; what relations you need; you've even decided that they are meaningless things that have no identifying characteristics and you've given them IDENTITY columns as physical keys. Then you've normalised the tables.

That has nothing to do with the Relational Model or Normalisation.

It is typical of thinking in terms of the app (understandable, because that's all you have, and you know the functional needs). But that has nothing to do with the data. if you create it from that perspective, you will not get a database, you will get a storage location for the app. Now if the app was to be written for one quiz and thrown away, that would be acceptable. But if the quiz is going to be run for a month or two; if the company is still going to exist in ten years; if the quizzes are going to grow; if the stats that are going to be run, are going to grow, improve and mature ... well you will have quite a few problems. Every times you enhance the app, you will have to replace its "database". At some point they will decide that the app (and the "database" contained within it) is too expensive to change, and ditch both of them.

Even for the short duration that the app exists, because the collection of data is not normalised, it is very limited, the "power" of the RM, that which is implemented in every commercial RDBMS, is lost to you. Navigation (which is cumbersome anyway, due to SQL joins being cumbersome) is even more cumbersome. Eg. to get from a child to a grandparent, you are forced to get the parent, even though to are not getting any columns from it. User can only access the data via the app, it is "closed". These days there are thousands of report tools that connect to the database and allow the users to perform all kinds of queries, which have not been identified to you yet, without having to wait for a change request to be implemented by the app team: that capability is lost. And you will have to "re-factor" the data heap every year or so.

Nomalisation

Nomalisation produces Open databases, that allow such access without limitation.

Data lives much longer than the app. It survives the app, and has a value to the org. When the app gets replaced, they will extract the data and want the replacement app to be ready with the data loaded. If they did not use a data modeller, new app designer did not learn any lessons from the old app designers mistakes, and the whole repeats itself.

Ok, so the org expects to exist next year. Ok, so you are going to run more than one quiz in production. Ok, so you've hired a data modeller to run his eye past your collection of data, so that you don't make those mistakes. Excellent. Thankyou. I accept.

Relational Modelling 203

Data is modelled (a) completely independent of the app and any functions that you might know about and (b) using completely different techniques, that app developers do not know about. So let's get started. This is not a formal lesson re Normalisation (that will take too long, and the theory will bore you to death); this is just looking over the modellers shoulder while he works. All you need to do is answer questions [every time I post a question mark, please, stop and answer, before moving on], honestly (not suggesting you are dishonest; just reinforcing that "I don't know" is quite acceptable because it identifies an area we have to work on; whereas the clear affirmative and negative allows us to avoid discussion). Also, forgive me, but I will assume the answers, just to avoid the otherwise back-and-forth delays; please point out any of my mistakes, and I will correct the model.

Ok, you have some Entities that you are clear about, so let's start with them, I think they look like this Scribble in the Sand. I am purposely not using standard symbols because I do not want those meanings drawn into it at this stage, because we have not achieved those meanings, and we don't want to convey false expectations to people (including ourselves) who understand standard symbols.

How are the rows in each Entity identified uniquely from every other row ? (This is important because it helps us verify that the Entity is in fact an Entity, and not Donald Duck, something not real.) From the columns provided, you would have (at least):
Question.Question
.
Media ? how exactly, am I going to ensure that there are not 42 images which are identical ? If I allow that, you will kick my rear end when our boss screams at you, and then he will kick it as well. IDENTITY won't help me, it will happily let me insert duplicates. FileName is a good identifier. Media.FileName
.
Topic ? would you like 101 identical Topics ? I thought not.
Topic.Topic
.
Subtopic.Subtopic ? No. IDENTITY ? No. You are pretty clear it belongs under Topic, and I am already carrying Topic as a Foreignkey, which has meaning; if it is a dependent child, not an independent orphan, then the FK is an Identifying Relation:
Subtopic: (Topic, Subtopic).
.
Ok, we have fuselage.
How are the Entities related to each other ?
You are clear that Subtopic is a child of Topic, fine.
.
Question is a child of Subtopic? I think not. Do you really want the same Question "how many cars does the household own" which may exist in 42 quizzes, inserted in 42 rows ? "there will never be the same question in two quizzes" is suspicious, so please don't say it, unless you are going to put it in writing. What's worse is, let's say the question is a very picturesque one, then we will have 42 Media entries as well. If the boss gets upset at having to change the same question in 42 places, and when he does that he finds some of the images were ancient, they were not updated when the major image update took place last year ... it's your rear end.
.
How about we allow for the possibility of a Question existing in more than one Quiz ?
.
So Question is Independent.
Back to Entities. In our discussions, starting with the fifth word, you keep saying "Quiz", and I keep saying it, but we don't have an Entity for it. I can't imagine a loose and duplicated set of Questions, that I've just de-duplicated, without identifying said collection of Questions. Now you may say "actually the Quiz is the Topic", but that is very restrictive: every time the boss needs to add a quiz for some new customer, he has to add the whole Topic/Subtopic/Question set all over again, even when he knows he has entered those exact Questions before, in existing Quizzes; which is why he won the new business, and why he was going to finally make a profit, which you've just reduced to zero. My rear end is still hurting from his last upset, so let's make sure he does not have another excuse. This let's him grow, nurture, and change Quizzes and Topic/Subtopics independently, without duplication of Questions: .
Entity. Quiz
Key ? Well, it better have an index on it to ensure we do not supply him with duplicate Quizzes. Let's say that we don't want to carry CHAR(80) keys into the children, and the user think a ShortName CHAR(12) that they get to make up is a nice identifier, better than a 10 digit number. Quiz: QuizCode . ok, we have fuselage big enough for payload. So far we have this Improved Scribble in the Sand.
What Entities are truly Independent, that can exist without the existence of other Entities; and the remainder, which exists only in the context of another Entity ?
Media is still a bit up in the air. While we have excluded duplicate FileNames, we have not allowed for the same Image being used in more than one Question. Let's allow that. We do not want an IDENTITY, we are already carrying the Question PK, so it is (a) Dependent and (b) Identified, by the parent.
.
Question PK as CHAR(255) was fine before, but carrying it as the FK in the child becomes silly. You like IDENTITY, and there are going to be millions of Questions. Fine. Question.QuestionId
.
Let's give these Independent Entities square corners, and leave the children Dependent Entities round. Now we have this almost, but not quite, Entity Relation Diagram.
Let's finish the Relations. So far we have:
A Quiz can have many Questions, and a Question can exist in many Quizzes.
Question is a child of Subtopic, but it is Independent.
Media exists only in the context of Question, as an option.
.
Since we live in an Universe that has some order, and our endeavour is a scientific one, let's give a child of it some order. I think we now qualify for a Logical Entity-Relation Diagram.
The solid lines are Identifying Relations; the broken lines aren't. Now, because we have applied some standard requirements to the Entities and Relations, we can use standard symbols. Feel free to point out my mistakes.

That's it, we are done. I hope I have conveyed that the exercise is back-and-forth, that's why it is called modelling. The astronauts have separate, private quarters, and they can communicate with each other.

------------------------------------------------------------------------------------------

What's that ? You want the Data Model as well ? Ok, give me five minutes. Please grab a coffee for yourself.

------------------------------------------------------------------------------------------

It is small enough that I may as well give you the physical Data Model. I use the IDEF1X methodology for modelling Realtional Databases, the Data Model uses that nomenclature: I think I have explained the symbols above, except for:

the Primary Key is above the line, in the sequence shown (unique of course)
and carried into the child as an Foreign Key (bold)
AK stands for Alternate Key, meaning an additional (to the PK) Unique Index
IE stands for Inversion Entry, meaning an additional Non-Unique Index.
Indentifying Relations are solid lines; strong keys, they have meaning
Non-Identifying Relations are broken lines; weak, less meaning

.

Many-to-many Relations are a logical concept, and drawn as such in the Logical Model. They are implemented as Associative tables in the physical model. I have supplied that. A Normalised database has more tables (nothing to be afraid of) but fewer columns per table, and no duplicate columns (no Update Anomalies).
Oh my God. The Topic and SubTopic columns are huge! We can't migrate those fat foreign keys into Question. [Discuss with The Business Users.] Ok, they say there will be only one hundred Topics and a few thousand Subtopics. No need for a NUMERIC(10,0). They want the full Topic and Subtopic in the drop-downs, and they agree it has to be unique, but an additional short CHAR(6) code would be nice.

See, it does go back-and-forth. Paper is cheap; discuss with anyone and everyone; improve, correct, change, modulate, ameliorate, without creating a single table or writing a line of code; at the you will have a model worth writing code for. Anything else is not. The only way to learn is by presenting something solid, and having it knocked back or changed; make all your mistakes on paper, not in the database.

Note that a Surrogate key is always an additional key, an additional index. They are never a substitute for the Key (which is what you had, and what Eddie was trying to get you to think about: you were not preventing duplicates, you merely had a meaningless key guaranteeing the rows were unique, like a spreadsheet; and a false sense of security). So we need to minimise them, not stamp them willy nilly on every table.

I hope I have shown above, simplistic rules such as "never use surrogates" and "always use surrogates" are too stupid to discuss. No, careful modelling means: understand and consider that they are additional, not in substitute of, the real Key. Use only when you have to, and when you have to, use them. Here I have managed to get away with just one IDENTITY. And the three short codes that are meaningful to the users are surrogates, but they have meaning; IDENTITY columns don't (they end up having meaning, which they can't support, which is part of the problem out there).

I will leave the DataTypes to you. But please keep in mind, varchars and Nullable columns force the column to be variable. Really slow if used in an Index (every entry has to have a bit of "unpacking" on every access, even the intermediate levels), so that must be avoided, unless you want to supply the boss with a slow database.

Likewise if you don't want the overhead of your rows on the page being moved on every UPDATE, fixed length columns all around. Which means we can't be lazy and varchar everything.

Ok, now we have a chamber to contain rocket fuel.

Response to Comments 1

from your last data model, would it be better if I omitted topicCode from the question table? DOesn't it feel little redundant to include topicCode in both the subtopic and question tables?

Excellent question.

(The last, fifth, is the Data Model; the fourth is the Entity Relation Diagram; the first three are pie in the sky, getting there.)

A Primary Key migrated as a Foreign Key in the child, is not redundant, it is required.
The Subtopic PK is (TopicCode, SubtopicCode), a Compound Key (which commercial databases support, as a requirement of the Relational Model). Above the horizontal line in Subtopic.
- Some developers are scared of Compound Keys because they require more than one reference in the WHERE clause (pure laziness; SQL is cumbersome with joins; deal with it).
  .
That is because the relation Topic::Subtopic is an Identifying relation, which means the PK of the Parent is used to construct the PK of the Child, forming a Compound Key. Note the Parent PK has to be carried in the Child anyway, as a FK, so it is not redundant; it is required. This is known to substantially increase the "power" or "relationality" of the db, and substantially increase ease of use (power users are often more adept in using a RDb than developers).
- That is why it is identified as the standard: IDEF1X is a standard that enforces a stricter application of the Relational Model; it promotes thoughtful consideration of all Keys, which of course is crucial to the "relationality" of a supposed Relational Database.
  .
In Question, the FK to Subtopic is therefore (TopicCode, SubtopicCode).
If you used IDs in Topic and SubTopic, then Question would have (SubtopicId) as the FK to Subtopic, and you lose both navigational power and meaning.
- One reason that using compound keys which identify the parent (and grandparent) is superior to using IDs is, eg. when you need columns from Question and Topic only, you do not have to SELECT from Subtopic, you can get TopicCode directly from Question; whereas with IDs, you are forced to SELECT from Subtopic, which is not part of the report requirement, it is an imposition due to poor relational modelling; failure to implement Identifying Relations, strong meaning. Just One Reason that data heaps that have IDs only as PKs (carried as FKs) perform much more work than Normalised Relational Databases.
  - Because the tables concerned are simple and small, this may not be a good example for demonstrating the superiority of Relational keys/limitation of IDENTITY columns; consider any large Grandparent::Parent::Child tables.
  - A better example for demonstrating this superiority/limitation, and worth consideration for anyone wishing to discuss this issue further, is the Data Model from my Advanced Course, because this issue is addressed explicitly.
- Another reason is lost meaning. Eg. an user can simply SELECT TopicCode, SubtopicCode, Question FROM Question, and make sense of result set.

Response to Comments 2

Progressed Data Model 101102

Response to Comments 3

I don't understand how you can use the same image file with different questions?

Simple eg. "name a flightless bird" and "what is an Emu" can both use the same photograph of an Emu. That was my assumption as per (1) in the main post above. The model allows it by providing a Non-Unique index on FileName.

Should you include Media.FileName foreign key in the Question table and remove QuestionId from the Media table?

Well, no. There are several advantages to the way I have modelled it. If it is in the Question table, then some large potion of the time, it will be Null. You need an index on Filename to search on them; see if it is used; or not; etc. Which means we cannot index the FileName column (or we can index it, but the index will be slow due to Nulls). Right now it is modelled as:

an "option" or Subtype of Question (one that has an image)
No Nulls
Fast Non-unique Index

It can be set up as an image bank (Unique index). We know a Question can have zero-to-one Images. You can tell us your decisions, and I will change the model:

Images are Unique re FileName; they can be used in only one Question
Images are Unique re FileName; they can be used in more than one Question

And what kind of relationship is that? 1::1?

Yes. With the Media end being "optonal". Precisely: one Question has zero-or-one Media. Hence "May Have" is the Verb Phrase or Title of the relation.

The model is drawn with The Natural hierarchy in mind: Parents are above children; subtypes are on the same horizontal.

Doesn't look like other n::n or 1::n relationships.

There are no n::n relations. (There was one at the Logical level; which was implemented as the Associative table at the physical level.)

*Why did you include the Answer field in the QuizQuestion table?

Where would you like to store the users' answers ?

See, you need the Quiz and QuizQuestion tables; they were valid entities before, you did not see that then, but you do now.

The relationship between topics and subtopics are defined in the Subtopic table. In the Question table, you have included TopicCode and SubtopicCode. You said (if I'm not mistaken) that including the TopicCode is good because I can get to the Topic directly from the Question table without joining the Subtopic table.

Yes. As per (1.Subtopic) above.

But, given your model, every time when I enter a new question to the DB, I should choose a topic and a subtopic [for the question, from the existing list].

Well, you have to do that anyway; the model just enforces it. I understood (from the data) that that is what you needed. One Suptopic can belong to more than one Topic. So you have to give it both a Topic and a Subtopic (the specific combination).

How does this database make sure that the subtopic belongs to a certain topic as described in the Subtopic table?

You don't see that ? Can you see that the Subtopic table already supplies specific [whatever you have inserted] Topic::Subtopic combinations ... and not others ? Ok, so when you add a new Question, and give it a Topic and Subtopic (which is the Primary Key of SubTopic, and the Foreign Key in Question), the database will enforce the FK such that only one of the combinations of Topic::Subtopics that exist in SubTopic can be used.

That's a small insight into the power of Good Natural Relational Keys.

That's the beauty of a good Relational model: it supplies many requirements from a single [correctly modelled] structure.

Response to Comments 4

Progressed Data Model 101103

PerformanceDBA 2010-10-31 06:13:52

Thank you for your answer. I can't correct your mistakes. I can barely understand what you've wrote. ;)

Baha 2010-10-31 15:08:42

Thanks for your honesty. I did not mean "correct", I meant "point them out" and I will correct them. Ok, check the Data Model (I did expand my post), and see if it works for you, if there is anything you think you can't do.

PerformanceDBA 2010-10-31 16:05:50

from your last data model, would it be better if I omitted topicCode from the question table? DOesn't it feel little redundant to include topicCode in both the subtopic and question tables?

Baha 2010-10-31 19:48:08

Answer 5

A:

Suggestion to naming conventions:

Primary key: someNameId Foreign key: someOtherName_Id

At the moment in few tables you have media_id and question_id. It's unclear what's what without looking at the data structure. Now imagine you have 80 tables and 20 to 40 attributes in each table. You'll get lost in where is a foreign key and where is a primary key.

vikp 2010-10-31 12:50:20

I disagree strongly. With one exception, the column name for the PK and the FK should always be **identical**: In the child, ParentKey has strong meaning. The only exception is where the role changes or where it is ambiguous. The PartId PK in the Assembly table, which has two PartId FKs, will be AssemblyId and ComponentId. Separately, if you have 80 tables with 20 to 40 columns in each, which of course means you have FKs all over the place, you have an un-normalised data heap, not a database. Now there, you would need help with all those duplicate FKs, but not in a database.

PerformanceDBA 2010-10-31 13:21:06

Clarity. media_id is always media i.d., and nothing but. If it is in media, then it is the PK. If it is anywhere else, it is an FK. What is unclear about that ? I do not need to see the Data Model, but I would, if it has been drawn is some logical manner, that conveys meaning and order, and not 80 tables squeezed into the rectangular frames of pages..

PerformanceDBA 2010-10-31 14:06:17

Give me an example of a header table with 5 or more foreign keys and I'll explain my point.

vikp 2010-10-31 15:22:41

@vikp: you made your point, you provide the example, and you explain it. I have very few such tables, and they are customers, not mine to flaunt. And they certainly do not have the problem that you allude to. Second, does that mean that tables with less than 5 FKs do not have your problem ? If so, that means most peoples' tables. You are trying to solve a problem that isn't there, that only you have (due to your unnormalised colloection o data). See, you really need to provide your example.

PerformanceDBA 2010-10-31 15:49:44

ansaurus

tags:

views:

answers:

is my data normalized?

related questions