ansaurus

Question

Optimized Table structure for Tags table.

Answer 1

+4 A:

Using TagText as the primary key would have the advantage that you could get an articles tags with fewer joins:

SELECT * FROM Article_Tag WHERE Article_ID = ?

It would have the disadvantage that tag strings take more space than integers, so the storage for Article_Tag and its indexes would be larger. This occupies more space on disk and also requires more memory for caching the index.

Bill Karwin 2009-01-11 22:45:48

Answer 2

+5 A:

It depends if you ever want to be able to change the tag-text globally. You could, of course, issue a wide UPDATE on Article_Tag, but if you need to be able to do this, then being able to just update the value in Tag would be simpler. Some servers offer automatic updates (such as ON UPDATE CASCADE in SQL Server), but these aren't necessarily cheap (it still has to UPDATE a lot of rows and any indexes defined).

But if you don't need this, then it should be a bit quicker with the literal in Article_Tag, as it can remove a join - a lot of times. Obviously, index it etc.

The additional space required for the repeated literal is a factor, but disk space is usually cheaper than a faster server.

As for being a primary key; unless you have other data to store, why would you even need the table any more? You could use DISTINCT on Article_Tag just as easily, especially as Tag should be indexed (so this should be pretty cheap). (edit Bill Karwin correctly points out the merits of being able to keep eligible tags, not just current tags).

Marc Gravell 2009-01-11 22:46:22

A table for Tags is useful if you do want to change tag names globally using a cascading update. Also SELECT DISTINCT on Article_Tags wouldn't get you a list of eligible tags, only tags currently in use.

Bill Karwin 2009-01-11 22:52:48

@Bill - fair enough; then keep `Tag` and use an `ON UPDATE CASCADE` (or the equivalent for your server) ;-p

Marc Gravell 2009-01-11 22:56:37

i only want tags in use

Shawn Simon 2009-01-11 22:58:41

@Shawn - fair enough, then as far as I can see, you don't really need the `Tag` table...

Marc Gravell 2009-01-11 23:00:32

@Bill - an `UPDATE` on `Tag` (with cascade) would do the same amount of work as an `UPDATE` issued on `Article_Tag`, and the syntax would be virtually identical. Not sure that (alone) is a reason to choose a `Tag` table.

Marc Gravell 2009-01-11 23:01:47

seems like the queries would be slow then, such as: select articleid from article a inner join article_tags at on a.articleid = at.articleid and at.tag = 'books'

Shawn Simon 2009-01-11 23:02:09

@Shawn: Why would that be slow? or rather: why would it be slower than the same with a second join? Any decent optimiser should sport the index on Article_Tag.Tag, do a seek to that value, then join those Article_ID to the Article table.

Marc Gravell 2009-01-11 23:18:19

@Shawn: for maximum performance, perhaps consider a spanning index on {tag,article_id} **in that order**, then it can avoid a bookmark-lookup and just use the index.

Marc Gravell 2009-01-11 23:19:17

Answer 3

A:

Welcome back, modesty. Or Localghost, or Shawn or however you call yourself now. Just keep in mind that there is no Hackers badge anymore, so nothing to win here :)

Michael Stum 2009-01-27 23:05:03

Answer 4

+2 A:

I'd go with 1 every time. It's fully normalised and since you're using a synthetic PK, you can change the name of a tag with a single row update.

The only advantage otherwise is reducing the number of joins. This is an optimisation, which we all know you should only make after measurement. If you were sure structure 1 wasn't fast enough you wouldn't be asking, right?

Now there isn't a whole lot of difference between 2 and 3, but as Bill Karwin notes, 3 has advantages in terms of cascading updates. More to the point the extra table loses you nothing.

So I would say go with 1. If there is a measurable (ie Provable) performance problem then 3 would be perfectly acceptable. It wouldn't be very hard to migrate later anyway.

Draemon 2009-01-28 04:37:40

Answer 5

A:

You should be mapping the TagText to TagId in code (and mapping to an in-memory cache anyway) and passing the pre-mapped TagId into your query.

Also there's no reason you need a synthetic key for the Article_Tag table. You should be using a composite-primary-key (ArticleId, TagId).

So, I say #1 with minor tweak mentioned above.

IDisposable 2009-01-28 04:53:35

Answer 6

A:

I'd go for Structure 2, maybe calling Article_Tag table simply - Tags.

dmajkic 2009-01-29 15:41:00

Answer 7

A:

A table with an AUTO_INCREMENT PK won't scale. Forget about TagID as INTEGER, and replace its type as BINARY(16), just enough for the MD5 checksum of TagText.

And with a proper cache layer, your SQL query won't need the TagText column as much as it needed.

yogman 2009-02-03 05:51:35

ansaurus

tags:

views:

answers:

Optimized Table structure for Tags table.

Structure 1 - TagID as int with a join table

Structure 2 - Tags only in Join table as string

Structure 3 - Tag as text as the PK

related questions