views:

126

answers:

4

I'm working on a little blog software, and I'd like to have tags attached to a post. Each Post can have between 0 and infinite Tags, and I wonder if it's possible to do that without having to join tables?

As the number of tags is not limited, I can not just create n fields (Tag1 to TagN), so another approach (which is apparently the one StackOverflow takes) is to use one large text field and a delimiter, i.e. "<Tag1><Tag2><Tag3>".

The problem there: If I want to display all posts with a tag, I would have to use a "Like '%<Tag2>%'" statement, and those can AFAIK not use any indexes, requiring a full table scan.

Is there any suitable way to solve this?

Note: I know that a separate Tag-Link-Table offers benefits and that I should possibly not worry about performance without measuring etc. I'm more interested in the different ways to design a system.

+3  A: 

A separate tag table is really the only way to go here. It is THE only way to allow an infinite number of tags.

Nico Burns
+6  A: 

Wanting to do this without joins strikes me as a premature optimisation. If this table is being accessed frequently, its pages are very likely to be in memory and you won't incur an I/O penalty reading from it, and the plans for the queries accessing it are likely to be cached.

Ken Keenan
Proper indexing would also contribute in reducing any penalty from a join.
Corey Sunwold
A: 

If you're using SQL Server, you could use a single text field (varchar(max) seems appropriate) and full-text indexing. Then just do a full-text search for the tag you're looking for.

David Archer
+2  A: 

This sounds like an exercise in denormalization. All that's really needed is a table that can naturally support any query you happen to have, by repeating any information you would otherwise have to join to another table to satisfy. A normalized database for something like what you've got might look like:

Posts:
PostID  | PostTitle    | PostBody          | PostAuthor
--------+--------------+-------------------+-------------
1146044 | Join-Free... | I'm working on... | Michael Stum

Tags:
TagID | TagName
------+-------------
1     | Archetecture

PostTags:
PostID  | TagID
--------+------
1146044 | 1

Then You can add a columns to optimise your queries. If it were me, I'd probably just leave the Posts and Tags tables alone, and add extra info to the PostTags join table. Of course what I add might depend a bit on the queries I intend to run, but probably I'd at least add Posts.PostTitle, Posts.PostAuthor, and Tags.TagName, so that I need only run two queries for showing a blog post,

SELECT * FROM `Posts` WHERE `Posts`.`PostID` = $1 
SELECT * FROM `PostTags` WHERE `PostTags`.`PostID` = $1

And summarizing all the posts for a given tag requires even less,

SELECT * FROM `PostTags` WHERE `PostTags`.`TagName` = $1

Obviously the downside to denormalization is that it means you have to do a bit more work to keep the denormalized tables up to date. A typical way of dealing with this is to put some sanity checks in your code that detects when a denormalized query is out of sync by comparing it to other information it happens to have available. Such a check might go in the above example by comparing the post titles in the PostTags result set against the title in the Posts result. This doesn't cause an extra query. If there's a mismatch, the program could notify an admin, ie by logging the inconsistency or sending an email.

Fixing it is easy (but costly in terms of server workload), throw out the extra columns and regenerate them from the normalized tables. Obviously you shouldn't do this until you have found the cause of the database going out of sync.

TokenMacGuy