It's a question of normalization. IMHO one of the best books on this subject is Joe Celko's SQL for Smarties. Basically, you avoid what are called "anomalies". In your example, if I delete all the questions with the "Java" tag, I would never be able to know that I ever had a tag called "Java" (delete anomaly). It's also important to crack out the table because you need the xref table to describe properties of the relationship between the principals.
http://en.wikipedia.org/wiki/Database_normalization
it's not a problem for a computer, but RDBMS theory said, that db should by normalized reducing info duplication. Here's what Dr. Codd said about need for normalization:
- To free the collection of relations from undesirable insertion, update and deletion dependencies;
- To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
- To make the relational model more informative to users;
- To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
E.F. Codd, "Further Normalization of the Data Base Relational Model"
The issue is one of how normalized you want your table structure to be. Generally, you don't want to store information in more than one place. To that end, when data may be repeated for many items, you normalize it -- move that data to a separate table where multiple rows in the other table may reference it by storing the key of the data rather than the data itself. When you have many rows sharing the same data AND you want to normalize it, you need an intermediate table to store the relations (reference pairs) between the tables.
Usually it is a lot more information than just a tag column. So if it is a lot of information then you have redundant data (you have 2 "C" values in your example). Then if the same value lives in more than one place updates become a problem. So the rule is that the data should live in 1 place and its ID is used in other places to reference it. Then when you update it, it only needs to be done in one place.
In a relational database a many-many relationship is implemented as two reciprocal one-many relationships, each of which requires an additional table (beyond the tables directly representing the entities) to implement.
- First, a one-many relationship between a row in the first table to many rows in the second table.
- Second, another one-many relationship between a row of the second table to many rows in the first table.
The why of it has to do with the relational database model.
Just to add to what others say (I wont repeat their comments)
In my experience, it's not typically called a help table but a join table. Normally you're dealing with something more complicated than a simple keyword. The 'extra' table models the relationship between the 2 other entities.
Another example might be I have a marketing campaign that goes to many recipient contacts. Neither of these 2 entities is dependent on the other. Any particular campaign will have many contacts, and any contact may be sent more than one campaign. The join table in this case models the history of who was sent which campaign.
Campaign
- CampaignID (PK)
- other columns
Contact
- ContactID (PK)
- other columns
CampaignContact
- CampaignContactID (PK)
- CampaignID (FK)
- ContactID (FK)
This is quite different from the 1-many relationship (sometimes called a master-detail relationship). Here a canonical example is Invoice -> InvoiceItems. The invoice items link specifically to one and only one parent invoice.
Invoice
- InvoiceID (PK)
- other columns
InvoiceItem
- InvoiceItemID (PK)
- InvoiceID (FK)
- other columns