views:

251

answers:

5

hi all,

while trying to figure out how to tag a blog post with a single sql statement here, the following thought crossed my mind: using a relation table tag2post that references tags by id as follows just isn't necessary:

tags
+-------+-----------+
| tagid | tag       |
+-------+-----------+
|     1 | news      | 
|     2 | top-story | 
+-------+-----------+

tag2post
+----+--------+-------+
| id | postid | tagid |     
+----+--------+-------+
|  0 |    322 |     1 |
+----+--------+-------+

why not just using the following model, where you index the tag itself as follows? taken that tags are never renamed, but added and removed, this could make sense, right? what do you think?

tag2post
+----+--------+-------+
| id | postid | tag   |     
+----+--------+-------+
|  1 |    322 | sun   |
+----+--------+-------+
|  2 |    322 | moon  |
+----+--------+-------+
|  3 |   4443 | sun   |
+----+--------+-------+
|  4 |   2567 | love  |
+----+--------+-------+

PS: i keep an id, i order to easily display the last n tags added...

+2  A: 

This sounds fine to me, using an ID to reference something that you delegated into another table makes sense when you have things that vary, say a user's name or whatever, because you don't want to change it's name in every place in your database when he changes it. However in this case the tag names themselves will not vary, so the only potential downside I see is that a text index might be slightly slower than a numeric index to search through.

Seldaek
+7  A: 

It works, but it is not normalized, because you have redundancy in the tags. You also lose the ability to use the "same" tags to tag things besides posts. For small N, optimization doesn't matter, so I have no problems if you run with it.

As a practical matter, your indexes will be larger (assuming you are going to index on tag for searching, you are now indexing duplicates and indexing strings). In the normalized version, the index on the tags table will be smaller, will not have duplicates, and the index on the tag2post table on tagid will be smaller. In addition, the fixed size int columns are very efficient for indexing and you might also avoid some fragmentation depending on your clustering choices.

I know you said no renaming, but in general, in both cases, you might still need to think about the semantics of what it means to rename (or even delete) a tag - do all entries need to be changed, or does the tag get split in some way. Because this is a batch operation in a transaction in the worst case (all the tag2post have to be renamed), I don't really classify it as significant from a design point of view.

Cade Roux
in addition they have to be reindexed on a rename
Shawn Simon
A: 

I was considering that too. Want a list of tags in the database, just select distinct tag from tag2post. I was told that since I wanted to optimize for select statements, it would be better to use an integer key because it was much faster than using a string.

Shawn Simon
+2  A: 

Where is the real advantage of your proposal over a relation table containing IDs?

Technically they solve the same problem, but your proposed solution does it in a redundant, de-normalized way that only seems to satisfy the instinctive urge to be able to read the data directly from the relation table.

The DB server is pretty good at joining tables, and even more so if the join is over an INT field with an index on it. I don't think you will be facing devastating performance issues when you join another table (like: INT id, VARCHAR(50) TagName) to your query.

But you lose the ability to easily rename a tag (even if you don't plan on doing so), and you needlessly inflate your relation table with redundant data. Over time, this may cost you more performance than the normalized solution.

Tomalak
+2  A: 

The de-normalised method may be fine depending on your application. You may find that it causes a performance hit due to searching a large set of VARCHAR data.

When doing a search for things tagged like "sun*" (e.g. sun, sunny, sunrise) you will not need to do a join. However, you will need to do a like comparison on a MUCH larger set of VARCHAR data. Proper indexing may alleviate this issue but only testing will tell you which method is faster with your dataset.

You also have the option of adding a VIEW that pre-joins the normalised tables. This gives you simpler queries while still allowing you to have highly normalised data.

My recommendation is to go with a normalised structure (and add de-normalised views a necessary for ease of use) until you encounter an issue that de-normalising the data schema fixes.

Chris Nava