tags:

views:

178

answers:

5

I have the option of writing two different formats for a database structure:

Article
-------
ArticleID int FK

Article_Tags
------------
ArticleTagID int FK
ArticleID int FK
TagText varchar(50)

or

Article
-------
ArticleID int PK

Article_Tags
------------
ArticleTagID int PK
ArticleID int FK
TagText varchar(50) FK

Tag
---
TagText varchar(50) PK

If I want a list of all tags in the database, I could use:

select distinct tagtext from article_tags

or:

select tagtext from tag

The first situation is pretty easy. Maybe it would speed up if i indexed it properly. The second is a little harder because I have to constantly delete tags with no joins. Imagine a pretty large system, which would be better.

+1  A: 

Option 2 with some modifications. Assuming you're limiting input to the values contained in the tag table you're then able to perform two different queries for different purposes.

The first query will give you a unique list of all currently USED tags.

The second query will give you all possible tags that can be used, including those that have not yet been used.

You would be better to use a tag ID and set up a relationship between the tag and article_tags tables. This would make option two more efficient as the index would simply be an integer rather than a string.

nullnvoid
integers are more efficient than strings?
Shawn Simon
Generally. Depending on the size.
nullnvoid
currently i have the second situation but using TagID int and Tagtext, just annoying having to delete any used tags after each query.
Shawn Simon
+4  A: 

I would go with whichever solution will result in the best/cleanest design for your application. If you need to attach data directly to a tag then a separate table (i.e. a more normalised solution) would be the correct one.

I would warn against worrying too much about the performance difference in the 2 proposed solutions, if indexed, the difference is likely to be insignificant (both are fairly common use cases and can be easily optimised using standard db techniques). Making a decision between the 2 presented options on the basis of performance sounds like a premature optimisation.

Michael Barker
+1  A: 

I would go with

Article
-------
ArticleID int PK

Article_Tags
------------
ArticleTagID int PK
ArticleID int FK
TagId int FK

Tag
---
TagId int identity(1,1) PK
TagText varchar(50)

There really is no reason to denormalize this from get go. (your first and second versions are not normalized)

Having tags in a separate table is fine, and getting the unique tags is cleaner without that distinct statement.

Performance wise if any of the 3 options are going to perform pretty much the same provided the correct indexing is applied.

Sam Saffron
are you sure about the second one not being normalized?
Shawn Simon
+1  A: 

To answer the basic question from the title: a DISTINCT usually means sorting the data. Depending on indices, the structure of the query and the amount of data returned this might be gratis (proper index on tagtext, ORDER BY tagtext, small return set) or not (missing index, order irrelevant, massive return set).

David Schmitt
Wrong, no sort as long as the obvious indexes are supplied. And DISTINCT would be very efficient - dbms's know how to skip from the start of one value to the next.
le dorfier
That's why I said "Depending on the indices [...] this might be gratis (proper index on tagtext [...]"
David Schmitt
+1  A: 

The overhead of maintaining two tables would be very unlikely to be a reasonable tradeoff for messing up a simple design.

It would be simple enough to test it both ways if you're really worried about it, but from my extensive experience there's no question this is true.

le dorfier
i will probably execute the update/insert queries on a separate thread anyway since theyre not mission critical
Shawn Simon