tags:

views:

83

answers:

4

Related

Using SO as an example, what is the most sensible way to manage tags if you anticipate they will change often?

Way 1: Seriously denormalized (comma delimited)

table posts
+--------+-----------------+ 
| postId | tags            |
+--------+-----------------+
|   1    | c++,search,code |

Here tags are comma delimited.

Pros: Tags are retrieved at once with a single select query. Updating tags is simple. Easy and cheap to update.

Cons: Extra parsing on tag retrieval, difficult to count how many posts use which tags.

(alternatively, if limited to something like 5 tags)

table posts
+--------+-------+-------+-------+-------+-------+
| postId | tag_1 | tag_2 | tag_3 | tag_4 | tag_5 |
+--------+-------+-------+-------+-------+-------+
|   1    | c++   |search | code  |       |       | 

Way 2: "Slightly normalized" (separate table, no intersection)

table posts
+--------+-------------------+
| postId | title             |
+--------+-------------------+
|   1    | How do u tag?     |

table taggings
+--------+---------+
| postId | tagName |
+--------+---------+
|   1    | C++     |
|   1    | search  |

Pros: Easy to see tag counts (count(*) from taggings where tagName='C++').

Cons: tagName will likely be repeated many, many times.

Way 3: The cool kid's (normalized with intersection table)

table posts
+--------+---------------------------------------+
| postId | title                                 |
+--------+---------------------------------------+
|   1    | Why is a raven like a writing desk?   |

table tags
+--------+---------+
| tagId  | tagName |
+--------+---------+
|   1    | C++     |
|   2    | search  |
|   3    | foofle  |

table taggings
+--------+---------+
| postId | tagId   |
+--------+---------+
|   1    | 1       |
|   1    | 2       |
|   1    | 3       |

Pros:

  • No repeating tag names.
  • More girls will like you.

Cons: More expensive to change tags than way #1.

+4  A: 

These solutions are called mysqlicious, scuttle and toxi.

This article compares benefits and drawbacks of each.

Quassnoi
Ah yes, I just found this. Very helpful!
bobobobo
A: 

I personally favour solution #3.

I don't agree that solution #1 is easier to mantain. Think of the situation where you have to change the name of a tag.

Solution #1:

UPDATE posts SET tag = REPLACE(tag, "oldname", "newname") WHERE tag LIKE("%oldname%")

Solution #3:

UPDATE tags SET tag = "newname" WHERE tag = "oldname" 

The first one is way heavier.

Also you have to deal with the commas when deleting tags (OK, it's easily done but still, more difficult that just deleting one line in the taggings table)

As for solution #2... is neither fish nor fowl

nico
+1  A: 

I would argue that there is a fourth solution which is a variation on your third solution:

Create Table Posts
(
    id ...
    , title ...
)
Create Table Tags
(
    name varchar(30) not null primary key
    , ...
)

Create Table PostTags
(
    PostId ...
    , TagName varchar(30) not null
    , Constraint FK_PostTags_Posts
        Foreign Key ( PostId )
        References Posts( Id )
    , Constraint FK_PostTags_Tags
        Foreign Key ( TagName )
        References Tags( Name )
        On Update Cascade
        On Delete Cascade
)

Notice that I'm using the tag name as the primary key of the Tags table. In this way, you can filter on certain tags without the extra join to the Tags table itself. In addition, if you change a tag name, it will update the names in the PostTags table. If changing a tag name is a rare occurrence, then this shouldn't be a problem. If changing a tag name is a common occurrence, then I would go with your third solution where you use a surrogate key to reference the tag.

Thomas
@Thomas: this is same as `scuttle`, only much harder to manage. Possibility to write `UPDATE Tags SET TagName = 'newtag' WHERE TagName = 'oldtag'` instead of `UPDATE PostTags SET TagName = 'newtag' WHERE TagName = 'oldtag'` is not really worth it.
Quassnoi
@Quassnoi - With Cascade Update you need only write `Update Tags Set Name = 'NewName' Where Name = 'OldName'`. No harder to manage that if you use a surrogate key. The real question is whether the benefit of avoiding the extra join outweighs the frequency with which you are altering an existing tag name. Since I would assume the later is infrequent, the performance benefit is probably worth it.
Thomas
@Thomas: again, how is it different from way 2 (`scuttle`) except having an extra table (`Tags`) which serves no purpose?
Quassnoi
@Quassnoi - The design is closer to Toxi than scuttle. The Tags table serves a very important purpose. It ensures that you cannot add a tag reference in PostTags that you not already have in the Tags table. I.e., you still maintain a single list of tags. In addition, it allows you add additional attributes on the tags just as you with with the Toxi setup. The difference is that if you want a list of posts for a given tag, you avoid additional work of *having* to join to the Tags table (you still can of course). With thousands of rows that will make no difference but on millions it will.
Thomas
@Quassnoi - Keep in mind that on the Tags table in the Toxi setup, you are still going to want a unique constraint on the Name of the tag anyway and it will likely be non-nullable. You might as well use it as the primary key in that case *unless* you knew you would have *many* calls to alter a tag name which I would think is unlikely.
Thomas
@Thomas: "cannot add a tag reference in PostTags that you not already have in the Tags table" is exactly what I meant by "harder to manage". "Avoid additional work of having to join to the Tags table" — how is it a difference? In `scuttle`, you don't have to join to the `Tags` table too (you don't have a `Tags` table, to begin with).
Quassnoi
@Thomas: I didn't say anything about `toxi` setup at all.
Quassnoi
@Quassnoi - In scuttle, how do you ensure a single unique list of tags? A unique constraint on category in scCategories? That prevents you from associating a tag more than once. If the UC is on bid and category, then one post can have `tsq1` and another can have `t-sql`.
Thomas
@Thomas: what is "ensure single unique list of tags"? You mean, like, create a list of all possible tags?
Quassnoi
@Quassnoi - NM. In the link's example, bookmarks would equate to Posts in our example.
Thomas
@Quassnoi - I suppose the idea behind scuttle is a loose association of tags to each other instead of a master list. The major downside is that there is no means to easily add attributes to a tag. E.g., in scuttle if you had 10 posts that used `t-sql` and you wanted to add a description for `t-sql`, there is no means to do that without duplication.
Thomas
@Thomas: sure, but I can hardly imagine why on Earth would you need description for a tag.
Quassnoi
@Quassnoi - I can. A friendly description about what the tag represents. Perhaps you want to deactivate a tag and not allow it to be used. Perhaps one day you want to allow for a hierarchy of tags and thus want a "ParentTag" attribute. It depends on how much control over the tags you want. If they are just fancy and arbitrary keywords, then scuttle might be a better approach.
Thomas
@Quassnoi - Hell, I wish SO would include a description that prodded users against choosing just the `sql` tag instead of a product/version specific tag. Hard to do with the scuttle approach.
Thomas
@Thomas: your table would come handy then, but without the referential constraint. I believe you still want to be able to add the tags even if they have not been described yet.
Quassnoi
and a +1, then :)
Quassnoi
A: 

I think that SO uses solution #1. I'd go with either #1 or #3.

One thing to consider is if you have several thing that you can tag (e.g. adding tags to both post and products, for example). This may affect database solution.

Juha Syrjälä