tags:

views:

859

answers:

7

Consider these 3 table structures. Which will perform these queries the best.

Structure 1 - TagID as int with a join table

Article
-------
ArticleID int

Article_Tag
------------
ArticleTagID int
ArticleID int
TagID int

Tag
---
TagID int
TagText varchar(50)

Structure 2 - Tags only in Join table as string

Article
-------
articleID int

Article_Tag
-----------
articleTagID int
articleID int
tag varchar(50)

Structure 3 - Tag as text as the PK

Article
-------
ArticleID int

Article_Tag
------------
ArticleTagID int
ArticleID int
Tag varchar(50)

Tag
---
Tag varchar(50)

Sample queries:

Select articleID from Article a inner join Article_tag at on a.articleID = at.articleID and tag = 'apple'
Select tag from Tags -- or, for structure 2
Select distinct tag from article_tag
+4  A: 

Using TagText as the primary key would have the advantage that you could get an articles tags with fewer joins:

SELECT * FROM Article_Tag WHERE Article_ID = ?

It would have the disadvantage that tag strings take more space than integers, so the storage for Article_Tag and its indexes would be larger. This occupies more space on disk and also requires more memory for caching the index.

Bill Karwin
+5  A: 

It depends if you ever want to be able to change the tag-text globally. You could, of course, issue a wide UPDATE on Article_Tag, but if you need to be able to do this, then being able to just update the value in Tag would be simpler. Some servers offer automatic updates (such as ON UPDATE CASCADE in SQL Server), but these aren't necessarily cheap (it still has to UPDATE a lot of rows and any indexes defined).

But if you don't need this, then it should be a bit quicker with the literal in Article_Tag, as it can remove a join - a lot of times. Obviously, index it etc.

The additional space required for the repeated literal is a factor, but disk space is usually cheaper than a faster server.

As for being a primary key; unless you have other data to store, why would you even need the table any more? You could use DISTINCT on Article_Tag just as easily, especially as Tag should be indexed (so this should be pretty cheap). (edit Bill Karwin correctly points out the merits of being able to keep eligible tags, not just current tags).

Marc Gravell
A table for Tags is useful if you do want to change tag names globally using a cascading update. Also SELECT DISTINCT on Article_Tags wouldn't get you a list of eligible tags, only tags currently in use.
Bill Karwin
@Bill - fair enough; then keep `Tag` and use an `ON UPDATE CASCADE` (or the equivalent for your server) ;-p
Marc Gravell
i only want tags in use
Shawn Simon
@Shawn - fair enough, then as far as I can see, you don't really need the `Tag` table...
Marc Gravell
@Bill - an `UPDATE` on `Tag` (with cascade) would do the same amount of work as an `UPDATE` issued on `Article_Tag`, and the syntax would be virtually identical. Not sure that (alone) is a reason to choose a `Tag` table.
Marc Gravell
seems like the queries would be slow then, such as: select articleid from article a inner join article_tags at on a.articleid = at.articleid and at.tag = 'books'
Shawn Simon
@Shawn: Why would that be slow? or rather: why would it be slower than the same with a second join? Any decent optimiser should sport the index on Article_Tag.Tag, do a seek to that value, then join those Article_ID to the Article table.
Marc Gravell
@Shawn: for maximum performance, perhaps consider a spanning index on {tag,article_id} **in that order**, then it can avoid a bookmark-lookup and just use the index.
Marc Gravell
A: 

Welcome back, modesty. Or Localghost, or Shawn or however you call yourself now. Just keep in mind that there is no Hackers badge anymore, so nothing to win here :)

Michael Stum
+2  A: 

I'd go with 1 every time. It's fully normalised and since you're using a synthetic PK, you can change the name of a tag with a single row update.

The only advantage otherwise is reducing the number of joins. This is an optimisation, which we all know you should only make after measurement. If you were sure structure 1 wasn't fast enough you wouldn't be asking, right?

Now there isn't a whole lot of difference between 2 and 3, but as Bill Karwin notes, 3 has advantages in terms of cascading updates. More to the point the extra table loses you nothing.

So I would say go with 1. If there is a measurable (ie Provable) performance problem then 3 would be perfectly acceptable. It wouldn't be very hard to migrate later anyway.

Draemon
A: 

You should be mapping the TagText to TagId in code (and mapping to an in-memory cache anyway) and passing the pre-mapped TagId into your query.

Also there's no reason you need a synthetic key for the Article_Tag table. You should be using a composite-primary-key (ArticleId, TagId).

So, I say #1 with minor tweak mentioned above.

IDisposable
A: 

I'd go for Structure 2, maybe calling Article_Tag table simply - Tags.

dmajkic
A: 

A table with an AUTO_INCREMENT PK won't scale. Forget about TagID as INTEGER, and replace its type as BINARY(16), just enough for the MD5 checksum of TagText.

And with a proper cache layer, your SQL query won't need the TagText column as much as it needed.

yogman